fix(parse/html): handle unclosed elements more gracefully #5063

dyc3 · 2025-02-08T20:00:47Z

Summary

This PR makes it so the HTML parser can handle unclosed elements without failing to parse outright. It attempts to resolve the SyntaxError failures from the prettier tests in #5056.

Test Plan

The related prettier tests parse successfully, and the existing tests emit diagnostics.

ematipico

Sorry to interject the PR, but this "fix" isn't obvious because no tests changed, other than the parser snapshots which is just a consequence of the new grammar.

For example, if the closing element is optional, it shouldn't emit diagnostics, instead the snapshots are still the same.

The formatting "bug" hasn't changed, I would expect something different from the snapshots, or at least we should have new one to cover future regressions.

ematipico · 2025-02-08T20:16:43Z

xtask/codegen/html.ungram

@@ -84,7 +84,7 @@ HtmlSelfClosingElement =
 HtmlElement =
 	opening_element: HtmlOpeningElement
 	children: HtmlElementList
-	closing_element: HtmlClosingElement
+	closing_element: HtmlClosingElement?


This isn't correct though, by spec a closing element should always be there. If we require handling some error case, we should use bogus nodes instead.

The formatter needs to be able to add closing tags to unclosed elements. See prettier's output: https://biomejs.dev/playground/?lintRules=all&files.main.html=PABkAGkAdgA%2BAA%3D%3D

The reason I did it like this is that that bogus nodes are not structured like normal nodes, and so it would be harder to extract the tag name so that the closing tag can be added. HTML_BOGUS_ELEMENT doesn't necessarily have a HTML_OPENING_ELEMENT when it occurs.

(Also, fun fact, the HTML spec does allow some elements to omit their closing tag, like <tr> and <td>. Prettier's parser doesn't handle that though. See the code examples here: https://html.spec.whatwg.org/multipage/tables.html#the-table-element)

I'll look into it some more and add some tests.

I suppose this comes from the fact that browsers can "fix" the HTML, this means that having something like the following snippet in valid in the browser.

<div><div><div><div><div><span><div><td><tr>

However, the w3c validator emits errors if the closing element is missing 🤔

I know that Astro parses the HTML as it was the browser, so it patches it during the compilation. I suppose it makes sense for a compiler, but I'm not sure it makes sense for a formatter.

I am torn about the change. There's also to note that Prettier uses a fork of the angular HTML parser, so we should expect that the parser is made for angular in the first place.

Maybe we could evaluate some options for HTML parsing (a strict one, where opening elements are mandatory, and a loose one; we can discuss it later). If you want to move forward with this change, that's fine. However, we need to change the parsing logic and not emit a diagnostic if the closing element is missing.

Actually, as far as I'm aware, the "auto fixing" that browsers do is just a thing browsers do and not actually specified in the html spec. The behavior I was talking about is the one that where if you give it <td> foo <td> bar it will actually result in <td> foo </td> <td> bar </td> (2 sibling tags) and not <td> foo <td> bar </td> </td> (where the first is the parent of the second). But I digress.

Would it make sense to have a new node defined like this?

HtmlElementUnclosed = opening_element: HtmlOpeningElement children: HtmlElementList

I attempted this in another branch and encountered some difficulties with the parser assigning the closing tag to the wrong element in cases like this:

<div> <span> </div>

Where it would assign the </div> to the <span> instead of the div, resulting in div becoming the unclosed element in the AST rather than the span.

codspeed-hq · 2025-02-08T20:46:54Z

CodSpeed Performance Report

Merging #5063 will not alter performance

_{Comparing html-unclosed-2 (0a105b9) with next (de27f6f)}

Summary

✅ 94 untouched benchmarks

github-actions bot added A-Parser Area: parser A-Formatter Area: formatter A-Tooling Area: internal tools L-HTML Language: HTML labels Feb 8, 2025

fix(parse/html): handle unclosed elements more gracefully

0a105b9

dyc3 force-pushed the html-unclosed-2 branch from f1e5b20 to 0a105b9 Compare February 8, 2025 20:01

dyc3 requested review from a team February 8, 2025 20:03

dyc3 marked this pull request as ready for review February 8, 2025 20:03

ematipico reviewed Feb 8, 2025

View reviewed changes

Base automatically changed from next to main February 12, 2025 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parse/html): handle unclosed elements more gracefully #5063

fix(parse/html): handle unclosed elements more gracefully #5063

dyc3 commented Feb 8, 2025 •

edited

Loading

ematipico left a comment •

edited

Loading

ematipico Feb 8, 2025

dyc3 Feb 8, 2025 •

edited

Loading

ematipico Feb 9, 2025 •

edited

Loading

dyc3 Feb 9, 2025

codspeed-hq bot commented Feb 8, 2025 •

edited

Loading

fix(parse/html): handle unclosed elements more gracefully #5063

Are you sure you want to change the base?

fix(parse/html): handle unclosed elements more gracefully #5063

Conversation

dyc3 commented Feb 8, 2025 • edited Loading

Summary

Test Plan

ematipico left a comment • edited Loading

Choose a reason for hiding this comment

ematipico Feb 8, 2025

Choose a reason for hiding this comment

dyc3 Feb 8, 2025 • edited Loading

Choose a reason for hiding this comment

ematipico Feb 9, 2025 • edited Loading

Choose a reason for hiding this comment

dyc3 Feb 9, 2025

Choose a reason for hiding this comment

codspeed-hq bot commented Feb 8, 2025 • edited Loading

CodSpeed Performance Report

Merging #5063 will not alter performance

Summary

dyc3 commented Feb 8, 2025 •

edited

Loading

ematipico left a comment •

edited

Loading

dyc3 Feb 8, 2025 •

edited

Loading

ematipico Feb 9, 2025 •

edited

Loading

codspeed-hq bot commented Feb 8, 2025 •

edited

Loading