Add a "streaming API" for incoming frames #296

cjerdonek · 2017-10-27T23:09:24Z

I'm wondering about the use case of transferring potentially large files over a websocket, where in the end the file would get written to the file system.

Currently, it seems the only way to do this with websockets's current API is to write a bunch of (large-ish) bytestrings to the file system (of size max_size) as they are read using recv(). I'm wondering if this approach seems perfectly fine, or if it would make sense for websockets to expose some kind of streaming API to bypass the creation of the intermediate bytestrings. This would be analogous to requests's "streaming" mode.

The text was updated successfully, but these errors were encountered:

aaugustin · 2017-10-30T21:47:16Z

Take 1

I'm not sure how well that use case fits the websocket protocol.

HTTP was designed to transfer documents, that is, HTML files, and is widely used to transfer files. Fetching a file and storing it locally is a fairly reasonable use case.

If you're transferring a single file over the lifetime of the a websocket connection, you could just as well use a HTTP GET (for dowload) or POST (for upload).

If you're transferring multiple files, you're going to need a way to carry metadata across the wire and to delimit file transfers.

The best fit for websocket I can think of is appending some data to a file — i.e. writing logs. In that case, writing a loop that reads messages and writes them seems reasonable.

Take 2

The websocket protocol provides a way to split a large message across multiple frames: fragmentation.

Currently websockets reassembles fragmented messages, but the resulting message must still be smaller than max_size and fit in memory.

I'm not coming up with an actionable way to frame this discussion... to be continued!

cjerdonek · 2017-10-31T00:16:17Z

Thanks for your thoughts. The use case I had in mind was--

the websocket server needs to receive a file from a client,
a websocket connection is already established between the client and server,
the file can be larger than max_size and so would require more than one message,
if POST were used, it may require chunking into multiple POST requests (but maybe not).

Since a websocket connection was already established, I was thinking it would be easiest to re-use that connection.

Your suggestion in "Take 1" is to receive the files "out of band." It seems that suggestion would require either running another server on a different port, as you suggest here, or else processing the POST "manually" in websocket's HTTP hook, which might be a pain (I'm not sure yet).

Independent of the question of file uploads, I think it would still be worth discussing whether an API that exposes a stream over a bytes return value would be useful. Maybe there are reasons why such an approach wouldn't end up saving anything significant, or maybe not.

RemiCardona · 2017-10-31T10:32:43Z

I'll just add here that plain old HTTP already offers plenty of features aimed squarely at file transfer:

byte range requests
checksumming
on-disk filename preservation (which can be different from the URL)
multi-part/chunked streaming (useful for data streams where the total size cannot be known in advance)
caching
and plenty more that are lesser known features

The beauty of websockets is that you already have HTTP. So really, as far as file transfer goes, I would really advise against reimplementing using websockets what HTTP already offers.

Now about the streaming API you mention, I think it may be overkill. Not only that, but browsers (for which WebSockets was created in the first place) don't have any sort of streaming API: plain method calls for send() and close(), event callbacks for onmessage reception. Streaming just isn't what WebSockets was designed for.

Cheers

cjerdonek · 2017-10-31T11:20:34Z

Streaming just isn't what WebSockets was designed for.

I think you may be misunderstanding what I'm suggesting. I'm not suggesting "streaming" in the sense of a use case. I'm suggesting the idea of the library's Protocol object returning a file-like object instead of a bytestring, e.g. for cases where the message will be written to the file system. The idea is that on the server-side, this could perhaps reduce memory usage in cases where the server is handling many messages.

Here is one example in the code where a bytestring is being created in memory and could perhaps benefit from a file-like object:

https://github.com/aaugustin/websockets/blob/27549c4b390443b7504e937d4d974bd0855b4c7f/websockets/protocol.py#L584

The beauty of websockets is that you already have HTTP.

Actually, if we're speaking of the websockets package, you don't really. For example, from websockets' documentation:

For the sake of simplicity, [the websockets package] doesn’t rely on a full HTTP implementation. Its support for HTTP responses is very limited.

So, many of the features you have in mind likely aren't present in the library. What's driving this issue in part is the possibility of doing simple file transfers within the websockets package without having to add the complexity of a full-blown, heavy-weight HTTP server.

cjerdonek · 2017-10-31T21:00:07Z

In the "send" direction, one probably relatively easy thing to do would be to update the WebSocketCommonProtocol.send() method to accept not just bytes objects for binary data, but also "bytes-like" objects (like memoryview).

aaugustin · 2017-11-01T20:37:11Z

Yes, send() should accept bytes-like objects rather than just bytes. It should also provide some support for fragmenting outgoing messages (#258) but I'm not yet sure what the API for that should look like.

The line you're quoting above is in change of reassembling fragmented messages. This is separate from the premise of this discussion, which is about "assembling frames".

However I think that's the right level to discuss this. Let's not invent an additional fragmentation mechanism over multiple messages, there's already one over multiple frames.

I'm interested in investigating smarter ways to handle reassembly of fragmented messages. In fact the RFC hints at this possibility:

IMPLEMENTATION NOTE: In the absence of any extension, a receiver
doesn't have to buffer the whole frame in order to process it. For
example, if a streaming API is used, a part of a frame can be
delivered to the application. However, note that this assumption
might not hold true for all future WebSocket extensions.

That could be quite hard to fit into the current architecture, though.

cjerdonek · 2017-11-01T20:57:49Z

Yes, I wasn't suggesting adding anything to assist with multiple messages.

I do need to familiarize myself with fragmented messages, though. But either way, wouldn't that line be affected by a "receive" API capable of returning a bytes-like object -- the idea being that you wouldn't need to join individual byte strings to create a larger one if you're dealing with bytes-like objects?

aaugustin · 2017-11-01T21:15:58Z

Yes, that line would need to change if we provided a streaming API for incoming fragmented messages.

Fragmentation in WebSockets is pretty simple:

cut a message in several parts
send each part in its own frame
set the opcode to OP_CONT on all frames except the first one (which keeps OP_TEXT or OP_BINARY)
set the FIN bit only on the last frame

The non-fragmented case follows the same rules; there's only one frame which is both the first and the last one.

cjerdonek · 2017-11-02T07:23:15Z

The line you're quoting above is in change of reassembling fragmented messages. This is separate from the premise of this discussion, which is about "assembling frames".

By the way, I could still be confused about what you have in mind because there is a bit of ambiguity in the phrase "assembling frames." It can be interpreted to mean either "assembling multiple frames to form a single message" or "assembling frames [from their parts]." (The latter interpretation could be what the implementation note is getting at where it refers to partial frames: "a receiver doesn't have to buffer the whole frame in order to process it.")

What also makes it confusing is that the phrase "fragmented messages" has a similar ambiguity. It can be interpreted to mean either a single message fragmented into multiple frames, or an end-user dividing a single "message" (in the broad sense of the word) into multiple websocket messages. (The latter is what I was agreeing this issue shouldn't be about.)

If I'm interpreting things correctly, the idea would be to possibly expose each frame as a bytes-like object, and also expose each message as a bytes-like object (which internally could be implemented by accessing the bytes-like objects of each frame in sequence). Also, with this approach, I believe max_size would play less of a role (or at least a different role) because the API wouldn't be exposing the message in its entirety. It would just be exposing the stream so the entire message wouldn't necessarily all need to be in memory.

aaugustin · 2017-11-02T19:51:58Z

Ugh, I can't make sense of what I wrote yesterday, I must have swapped some words, sorry :-(

Let me try again:

GOOD: multiple frames = 1 message. Improve websockets' support for handling fragmented messages on the way in (e.g. by providing the option not to reassemble in memory) and on the way out (currently there's nothing)
BAD: multiple messages = 1 file (or any other large content). Invent our own thing to reassemble multiple messages.

Your interpretation is correct anyway.

Since max_size is intended as a limit on the amount of data in memory at a given time, if we're building a streaming API, we could:

reinterpret it as the limit on each chunk / frame
introduce a separate limit
There are pros and cons to each approach.

cjerdonek · 2017-11-02T20:21:29Z

reinterpret it as the limit on each chunk / frame

Right, that's one of the approaches that occurred to me, too.

cjerdonek · 2017-11-02T20:34:18Z

Also, thanks for clarifying.

GOOD: multiple frames = 1 message. Improve websockets' support for handling fragmented messages on the way in (e.g. by providing the option not to reassemble in memory) ...

I just want to clarify / add that the "IMPLEMENTATION NOTE" you quoted above suggests this can be taken a step even further, namely by handling "partial frames" (i.e. as a frame is coming in). This is different in that it would also affect cases where the message isn't fragmented but is coming in as a single frame. So even the individual frame itself wouldn't need to be assembled in memory (if I'm interpreting that portion of the RFC correctly)...

cjerdonek · 2017-11-02T23:06:22Z

By the way, this recent thread (Oct. 18 with subject "APIs for high-bandwidth large I/O?") on the async-sig list might be of interest:
https://mail.python.org/pipermail/async-sig/2017-October/000392.html
The use case might not match exactly, but some ideas could be relevant.

aaugustin · 2018-05-06T08:21:44Z

Changing the title to reflect where the discussion took us.

aaugustin · 2018-09-23T16:16:02Z

The discussion here got quite long. In order to make it easier to move forwards, I split it into smaller issues.

Fragmentation of outgoing frames: Provide control over fragmentation of outgoing frames #258 and Support async iterators for fragmenting outgoing frames #477
Accepting bytes-like objects: Accept bytes-like objects #478
Non-reassembly of fragmented incoming frames: Support receiving fragmented messages without reassembly #479

I left aside partial read of frames (the IMPLEMENTATION NOTE I quoted from the RFC) because partial reads don't feel natural in asyncio and because it wouldn't work well with extensions. It would be quite complicated to support well. Until someone comes up with a compelling use case, I'm saying no.

I hope I didn't miss anything major. If I did, let's open additional issues.

aaugustin added the enhancement label Nov 1, 2017

aaugustin changed the title ~~writing large files efficiently?~~ Add a "streaming API" for incoming frames May 6, 2018

aaugustin added this to the someday milestone May 14, 2018

This was referenced Sep 23, 2018

Accept bytes-like objects #478

Closed

Support receiving fragmented messages without reassembly #479

Closed

aaugustin closed this as completed Sep 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a "streaming API" for incoming frames #296

Add a "streaming API" for incoming frames #296

cjerdonek commented Oct 27, 2017 •

edited

Loading

aaugustin commented Oct 30, 2017

cjerdonek commented Oct 31, 2017

RemiCardona commented Oct 31, 2017

cjerdonek commented Oct 31, 2017

cjerdonek commented Oct 31, 2017

aaugustin commented Nov 1, 2017 •

edited

Loading

cjerdonek commented Nov 1, 2017

aaugustin commented Nov 1, 2017

cjerdonek commented Nov 2, 2017

aaugustin commented Nov 2, 2017

cjerdonek commented Nov 2, 2017

cjerdonek commented Nov 2, 2017

cjerdonek commented Nov 2, 2017

aaugustin commented May 6, 2018

aaugustin commented Sep 23, 2018

Add a "streaming API" for incoming frames #296

Add a "streaming API" for incoming frames #296

Comments

cjerdonek commented Oct 27, 2017 • edited Loading

aaugustin commented Oct 30, 2017

cjerdonek commented Oct 31, 2017

RemiCardona commented Oct 31, 2017

cjerdonek commented Oct 31, 2017

cjerdonek commented Oct 31, 2017

aaugustin commented Nov 1, 2017 • edited Loading

cjerdonek commented Nov 1, 2017

aaugustin commented Nov 1, 2017

cjerdonek commented Nov 2, 2017

aaugustin commented Nov 2, 2017

cjerdonek commented Nov 2, 2017

cjerdonek commented Nov 2, 2017

cjerdonek commented Nov 2, 2017

aaugustin commented May 6, 2018

aaugustin commented Sep 23, 2018

cjerdonek commented Oct 27, 2017 •

edited

Loading

aaugustin commented Nov 1, 2017 •

edited

Loading