🐛 Bug Report — TypeError: memory limit has been exceeded with inline `data:` URL #2998

agjohnson · 2024-10-24T20:17:17Z

What is happening

When serving an HTML file with a large (>10MB) inline data: URL resource, even a basic worker will halt the response and respond with:

TypeError: Parser error: The memory limit has been exceeded.

I thought this was simply the response size, but this is not the case.

If I send the file as text/plain, the file will send through the worker
If I generate a large HTML file with /dev/random, the file will send through the worker
The response truncates right at the point in the file that a data URL starts

So, it does seem to me to be very specific to inline data URLs.

Does this imply the worker fetch() is somehow inspecting the data URL during streaming? This seems odd, but if so, is there a missing option to disable this?

Reproducing

Run a minimal worker
Output an HTML file with a large inline data resource
Try to fetch the file through the worker

Example data URL HTML generator:

#!/bin/sh
content=$(dd if=/dev/random bs=1MB count=10 | base64 -w0)
echo "<html><head></head><body><img src='data:image/png;base64,$content' /></body></html>"

Generate the HTML file and serve it:

% ./gen.sh > 10mb.html
% python -m http.server 9000

And the worker:

export default {
  async fetch(request, env, ctx) {
    const response = await fetch("http://localhost:9000/10mb.html");
    return response;
  }
}

Run and try to fetch:

% wrangler dev test.js
% curl -o - http://localhost:8787 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
             Dload  Upload   Total   Spent    Left  Speed
100    25    0    25    0     0     46      0 --:--:-- --:--:-- --:--:--    46

Notice only 25 bytes sent, which is the start of the img element

% dd if=10mb.html bs=25 count=1
<html><head></head><body>

The text was updated successfully, but these errors were encountered:

jasnell · 2024-10-25T00:07:49Z

I'm unable to reproduce any failure here with workerd given the example included. The error that you mention ("Parser error: The memory limit has been exceeded") is an error that HTMLRewriter would throw (it comes from lol-html). Your example is not using HTMLRewriter, however so I'm a bit confused. Can you please clarify?

Also, are you seeing this error with a production worker (e.g. wrangler deploy, run the worker) or in local dev (e.g. wrangler dev), etc.

agjohnson · 2024-10-25T20:29:00Z

This has me incredibly confused as well, there is indeed no instance of HTMLRewriter. What I posted as the worker script is what I am testing with, in a fresh/empty directory too, and running the wrangler dev command I noted above.

The exception is from the html-rewriter code though:

✘ [ERROR] workerd/server/server.c++:3422: error: Uncaught exception: workerd/api/html-rewriter.c++:99: failed: remote.jsg.TypeError: Parser error: The memory limit has been exceeded.

A couple more pieces:

% npm list
@[email protected] /tmp/test
├── @cloudflare/[email protected]
├── [email protected]
├── [email protected]
├── [email protected]
└── [email protected]
% node --version
v18.7.0

Also, I notice that the first request will succeed with a bad response (no error but only 25 bytes sent) and the second request will fail with a bad response (error raised and only 25 bytes sent).

agjohnson · 2024-10-25T20:44:00Z

I can only follow along loosely with C++, but notice there is implicit parsing of data URLs outside the HTML rewriter:

workerd/src/workerd/api/data-url.c++

Line 17 in a9cf631

kj::Maybe<DataUrl> DataUrl::from(const jsg::Url& url) {

It seems like there is some processing of data URLs as sub requests inside a fetch() call?

jasnell · 2024-10-27T16:42:55Z

The DataUrl implementation is entirely separate from the HTMLRewriter and we do not automatically parse data urls that are embedded in a stream. The only place a data: URL is handled is if you are explicitly passing that to a fetch call, e.g. await fetch('data:...'). There's nowhere else in the code where it is used directly.

agjohnson · 2024-10-28T16:36:49Z

Yeah I saw that parsing of the fetch() URL was covered there, I didn't see any tests describing what I'm noticing in the worker example above though.

When you say you aren't able to reproduce this bug, what do you notice on the response? Are you getting 25 bytes back, or the full 10MB?

Would a docker compose or something help verify this bug? I could stumble through rebuilding workerd too, there might be some way to catch what is happening with the data URL? I was sure that the minimal repro steps here would surface this bug for you, but I'm rather confused all around now.

jasnell · 2024-10-28T23:58:57Z

When you say you aren't able to reproduce this bug, what do you notice on the response? Are you getting 25 bytes back, or the full 10MB?

Keeping in mind that I'm testing with workerd only (not wrangler dev) ... I get the full response.

agjohnson · 2024-10-29T00:06:56Z

I don't know enough about using workerd directly, if there is something I can run locally and confirm what you're seeing, let me know.

Is it possible this is specific to wrangler/workers-sdk otherwise?

jasnell · 2024-10-29T14:35:45Z

It's possible since in that path something may be injecting polyfills and maybe HTMLRewriter use into the worker. Take a look through the compiled source that the CLI generates and see if you can spot anywhere HTMLRewriter may be getting added.

agjohnson · 2024-10-29T18:43:34Z

Are you describing generating this with Wrangler? With something like:

% wrangler deploy --dry-run --outdir dist

When I do that, the bundled file looks like this:

// test.js
var test_default = {
  async fetch(request, env, ctx) {
    const response = await fetch("http://localhost:9000/10mb.html");
    return response;
  }
};
export {
  test_default as default
};
//# sourceMappingURL=test.js.map

- cloudflare/workerd#2998

agjohnson · 2024-10-29T21:28:15Z

I don't know exactly what to generate here, so here's a docker compose with the minimal reproduction. This is just the example above, but isolated via containers to prove that there is indeed no HTMLWriter defined at any point in the worker.

https://github.com/agjohnson/wrangler-large-data-url

Let me know if any part of that isn't clear. If this seems like a wrangler issue, could we transfer the issue?

agjohnson mentioned this issue Oct 24, 2024

Documentation page is blank readthedocs/readthedocs.org#11693

Closed

2 tasks

agjohnson added a commit to agjohnson/wrangler-large-data-url that referenced this issue Oct 29, 2024

Initial reproduction of bug

7faac91

- cloudflare/workerd#2998

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bug Report — TypeError: memory limit has been exceeded with inline `data:` URL #2998

🐛 Bug Report — TypeError: memory limit has been exceeded with inline `data:` URL #2998

agjohnson commented Oct 24, 2024

jasnell commented Oct 25, 2024

agjohnson commented Oct 25, 2024 •

edited

Loading

agjohnson commented Oct 25, 2024

jasnell commented Oct 27, 2024

agjohnson commented Oct 28, 2024

jasnell commented Oct 28, 2024

agjohnson commented Oct 29, 2024

jasnell commented Oct 29, 2024

agjohnson commented Oct 29, 2024

agjohnson commented Oct 29, 2024

🐛 Bug Report — TypeError: memory limit has been exceeded with inline data: URL #2998

🐛 Bug Report — TypeError: memory limit has been exceeded with inline data: URL #2998

Comments

agjohnson commented Oct 24, 2024

What is happening

Reproducing

jasnell commented Oct 25, 2024

agjohnson commented Oct 25, 2024 • edited Loading

agjohnson commented Oct 25, 2024

jasnell commented Oct 27, 2024

agjohnson commented Oct 28, 2024

jasnell commented Oct 28, 2024

agjohnson commented Oct 29, 2024

jasnell commented Oct 29, 2024

agjohnson commented Oct 29, 2024

agjohnson commented Oct 29, 2024

🐛 Bug Report — TypeError: memory limit has been exceeded with inline `data:` URL #2998

🐛 Bug Report — TypeError: memory limit has been exceeded with inline `data:` URL #2998

agjohnson commented Oct 25, 2024 •

edited

Loading