Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(parse/html): handle unclosed elements more gracefully #5063

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dyc3
Copy link
Contributor

@dyc3 dyc3 commented Feb 8, 2025

Summary

This PR makes it so the HTML parser can handle unclosed elements without failing to parse outright. It attempts to resolve the SyntaxError failures from the prettier tests in #5056.

Test Plan

The related prettier tests parse successfully, and the existing tests emit diagnostics.

@github-actions github-actions bot added A-Parser Area: parser A-Formatter Area: formatter A-Tooling Area: internal tools L-HTML Language: HTML labels Feb 8, 2025
@dyc3 dyc3 requested review from a team February 8, 2025 20:03
@dyc3 dyc3 marked this pull request as ready for review February 8, 2025 20:03
Copy link
Member

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to interject the PR, but this "fix" isn't obvious because no tests changed, other than the parser snapshots which is just a consequence of the new grammar.

For example, if the closing element is optional, it shouldn't emit diagnostics, instead the snapshots are still the same.

The formatting "bug" hasn't changed, I would expect something different from the snapshots, or at least we should have new one to cover future regressions.

@@ -84,7 +84,7 @@ HtmlSelfClosingElement =
HtmlElement =
opening_element: HtmlOpeningElement
children: HtmlElementList
closing_element: HtmlClosingElement
closing_element: HtmlClosingElement?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct though, by spec a closing element should always be there. If we require handling some error case, we should use bogus nodes instead.

Copy link
Contributor Author

@dyc3 dyc3 Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatter needs to be able to add closing tags to unclosed elements. See prettier's output: https://biomejs.dev/playground/?lintRules=all&files.main.html=PABkAGkAdgA%2BAA%3D%3D

The reason I did it like this is that that bogus nodes are not structured like normal nodes, and so it would be harder to extract the tag name so that the closing tag can be added. HTML_BOGUS_ELEMENT doesn't necessarily have a HTML_OPENING_ELEMENT when it occurs.

(Also, fun fact, the HTML spec does allow some elements to omit their closing tag, like <tr> and <td>. Prettier's parser doesn't handle that though. See the code examples here: https://html.spec.whatwg.org/multipage/tables.html#the-table-element)

I'll look into it some more and add some tests.

Copy link
Member

@ematipico ematipico Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this comes from the fact that browsers can "fix" the HTML, this means that having something like the following snippet in valid in the browser.

<div><div><div><div><div><span><div><td><tr>

However, the w3c validator emits errors if the closing element is missing 🤔

I know that Astro parses the HTML as it was the browser, so it patches it during the compilation. I suppose it makes sense for a compiler, but I'm not sure it makes sense for a formatter.

I am torn about the change. There's also to note that Prettier uses a fork of the angular HTML parser, so we should expect that the parser is made for angular in the first place.

Maybe we could evaluate some options for HTML parsing (a strict one, where opening elements are mandatory, and a loose one; we can discuss it later). If you want to move forward with this change, that's fine. However, we need to change the parsing logic and not emit a diagnostic if the closing element is missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as far as I'm aware, the "auto fixing" that browsers do is just a thing browsers do and not actually specified in the html spec. The behavior I was talking about is the one that where if you give it <td> foo <td> bar it will actually result in <td> foo </td> <td> bar </td> (2 sibling tags) and not <td> foo <td> bar </td> </td> (where the first is the parent of the second). But I digress.

Would it make sense to have a new node defined like this?

HtmlElementUnclosed = 
	opening_element: HtmlOpeningElement
	children: HtmlElementList

I attempted this in another branch and encountered some difficulties with the parser assigning the closing tag to the wrong element in cases like this:

<div>
   <span>
</div>

Where it would assign the </div> to the <span> instead of the div, resulting in div becoming the unclosed element in the AST rather than the span.

Copy link

codspeed-hq bot commented Feb 8, 2025

CodSpeed Performance Report

Merging #5063 will not alter performance

Comparing html-unclosed-2 (0a105b9) with next (de27f6f)

Summary

✅ 94 untouched benchmarks

Base automatically changed from next to main February 12, 2025 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Formatter Area: formatter A-Parser Area: parser A-Tooling Area: internal tools L-HTML Language: HTML
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants