Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Serialize normative html5 doctype or nothing #7793

Draft
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

sirreal
Copy link
Member

@sirreal sirreal commented Nov 13, 2024

See this comment

//We ought to preserve the mode of the input document// but I don't think we should preserve the PUBLIC and SYSTEM identifiers. That is, only return one of two predefined DOCTYPE declarations.

If we have a standards document input, we return <!DOCTYPE html>. If we get a quirks-mode document input, we return something like <!DOCTYPE html PUBLIC "-//WordPress/HTML/quirks-mode"> to fully identify the document as such.

Trac ticket: https://core.trac.wordpress.org/ticket/62396


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. A document without a doctype declaration is automatically in quirks mode?

⤷ Anything else
If the document is not an iframe srcdoc document, then this is a parse error; if the parser cannot change the mode flag is false, set the Document to quirks mode.

So if we ignore the token we get a quirks-mode document? But then we don’t positively identify it as such, we instead let it be without a doctype? Do you think there's still value in positively affirming that it’s in quirks mode? I’m thinking of systems that go on to do further work with the HTML

Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@sirreal
Copy link
Member Author

sirreal commented Nov 13, 2024

This change would by lossy regarding documents in limited-quirks mode. A case could be added for that:

The public identifier starts with: "-//W3C//DTD XHTML 1.0 Frameset//"
The public identifier starts with: "-//W3C//DTD XHTML 1.0 Transitional//"
The system identifier is not missing and the public identifier starts with: "-//W3C//DTD HTML 4.01 Frameset//"
The system identifier is not missing and the public identifier starts with: "-//W3C//DTD HTML 4.01 Transitional//"

@sirreal
Copy link
Member Author

sirreal commented Nov 13, 2024

So if we ignore the token we get a quirks-mode document?

That is my understanding. Processing a DOCTYPE token can result in quirks, no-quirks, or limited-quirks. The absence of the DOCTYPE proceeds through to subsequent insertion modes and sets the document mode to quirks. That logic is encoded in the HTML Processor here:

/*
* > A DOCTYPE token
*/
case 'html':
$doctype = $this->get_doctype_info();
if ( null !== $doctype && 'quirks' === $doctype->indicated_compatability_mode ) {
$this->compat_mode = WP_HTML_Tag_Processor::QUIRKS_MODE;
}
/*
* > Then, switch the insertion mode to "before html".
*/
$this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_BEFORE_HTML;
$this->insert_html_element( $this->state->current_token );
return true;
}
/*
* > Anything else
*/
initial_anything_else:
$this->compat_mode = WP_HTML_Tag_Processor::QUIRKS_MODE;
$this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_BEFORE_HTML;
return $this->step( self::REPROCESS_CURRENT_NODE );

@sirreal
Copy link
Member Author

sirreal commented Nov 13, 2024

Do you think there's still value in positively affirming that it’s in quirks mode?

This is the question 🙂

If we're going to have a quirks-mode DOCTYPE, I think it should either be omission or the (normalized) input DOCTYPE.

In the same way that <!DOCTYPE html> is the ideal no-quirks doctype, I feel that omission is the ideal quirks-mode DOCTYPE.

There remains the issue of limited-quirks DOCTYPEs, and those probably need to have their (normalized) DOCTYPES preserved given it's unclear what the ideal limited-quirks DOCTYPE would be.

I don't feel strongly about any of these, or even that this PR is a better approach. I'm open to being convinced.

DOCTYPEs do contain information that's accessible in the browsers. They can be normalized to a standard form, but I'm a bit reluctant to perform this "aggressive" normalization to quirks/limited-quirks/no-quirks DOCTYPEs.

@dmsnell
Copy link
Member

dmsnell commented Nov 14, 2024

DOCTYPEs do contain information that's accessible in the browsers. They can be normalized to a standard form, but I'm a bit reluctant to performa this "aggressive" normalization to quirks/limited-quirks/no-quirks DOCTYPEs.

they do, and this is one reason that I think it’s proper to normalize them as well. talking about predictability and improved reliability for unreliable legacy and new systems, it makes me wonder who we’re benefiting by passing along a dorktype.

do we know the effects of limited-quirks mode? so far we don’t even track that do we? I guess I’m comfortable leaving it a known potential bug until we understand the implications of it, but the fix is trivial — add that extra case.

omission is the ideal quirks-mode DOCTYPE.

for all fragment parsers this is the default behavior. we’re considering the wholesale rewrite of a full document in this patch. perhaps that’s good reason to remind ourselves that the impact of this existing behavior and change is minimal.

still, it worries me leaving things ambiguous, since the same HTML could appear as standards-mode in which case it would parse differently. there’d be no clue to the provenance of the document as a quirks- or limited-quirks document. this is also why I thought about using a new public identifier linking back to WordPress — any downstream system could see that and link back to the HTML API. “Oh! this is quirks-mode produced by WordPress, that’s why it’s this way…” Maybe my worries are overblown.

@sirreal
Copy link
Member Author

sirreal commented Nov 14, 2024

do we know the effects of limited-quirks mode? so far we don’t even track that do we?

The DOCTYPE parser does recognize limited-quirks doctypes correctly according to the spec. The HTML processor doesn't care about it because it does not affect parsing or tree construction, I believe it only affects rendering.

I believe limited quirks keeps some of the rendering peculiarities of quirks mode, but does not include things like case-insensitive class selectors or the P > TABLE tree construction difference that's in quirks mode.

Here's an example of a rendering and/or tree construction differences:

Limited quirks No quirks
limited-quirks no-quirks

Chrome, Firefox, and Safari all continue to support limited-quirks mode.

Wikipedia says this:

A third compatibility mode known as "limited quirks mode" (previously, "almost standards mode" or "strict mode"), which maintains the "traditional" vertical sizing of table cells according to the CSS2 specification, has been implemented in these browsers: Safari, Opera 7.5 (and later), all Gecko-based browsers since 1.0.1 (such as Firefox) and Internet Explorer 8.[2]

"Almost standards" mode rendering matches "standards" mode in all details except for one. The layout of images inside table cells is handled the same way "quirks" mode operates, instead, which is fairly consistent with legacy browsers such as Internet Explorer 7 (and earlier). This means that sliced-images-in-tables layouts are less likely to fall apart in browsers when in either "quirks" or "almost standards" mode, rather than "standards" mode.[5]

@sirreal
Copy link
Member Author

sirreal commented Nov 14, 2024

In my opinion, the safest, least surprising, and most obvious thing to do is serialize a normalized version of the DOCTYPE token as was done in #7780. This respects the input and maintains the no-/limited-/quirks mode of the document. I don't think we can go wrong preserving these aspects of the original document.

I could be convinced that replacing no-quirks doctypes with the standard <!DOCTYPE html> is reasonable and good. However, I'm less convinced that quirks mode doctypes should be completely removed or radically rewritten, but if we replace the no-quirks doctypes then it seems like this should be done as well 😕

Finally, I'm really not comfortable with changing a document's mode from limited-quirks, which probably means that limited-quirks must respect the original doctype, which leads me back to thinking that the original doctype should just be respected.

@dmsnell
Copy link
Member

dmsnell commented Nov 15, 2024

less convinced that quirks mode doctypes should be completely removed or radically rewritten

you’ve made a compelling case for preserving limited-quirks mode, though I’m just going to take you at your word that those screenshots are different — I don’t see it.

I’m still for radical rewriting though. In serialize() we also strip duplicate attributes, normalize attribute values, re-encode HTML text, drop ignored tokens, write tags that didn’t appear in the input, and perform all sorts of radical rewriting.

For the purpose of normalization and predictability, I still like the idea of pre-defined DOCTYPE tokens, just now with one known to create limited quirks. That breaks the reference to WordPress (since any new DOCTYPE will indicate quirks-mode) but it would also mean that all documents produced will have one of three very-easily searchable DOCTYPE declarations.

Again, this is an issue on the periphery I think, as most serializations will likely occur on fragments and most will be HTML5 (or should be). If not for the complexity involved I’d far prefer we produce HTML5 standards-mode documents for all inputs, but alas, that seems intractable (okay maybe I’ll be eating these words in a few months when we do it).

That means I will support your choice, but I think that preserving “dorktypes” is a misunderstanding of the goal of serialization() — it’s not about preserving but rather normalizing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants