Implement more optimal scheduling for pages stream #5

a1akris · 2024-01-19T18:41:36Z

Currently the pages stream is an equivalent of a hand-written loop that can look something like this:

let mut request = SomeRequest::default();

loop {
    let response = client.send(request).await?;
    
    process(response.extract_data());

    match response.token {
        Some(next_page_token) => { 
            request = SomeRequest::from(next_page_token);
        }
        None => break,
    } 
}

The loop above sends the request awaits the response and process results sequentially. However, there is a more optimal way of doing the same things which should be fairly easy to implement by modifying the current request_next_page impl. If there exists a next page we can send a request for it immediately, even before yielding the current response results, and at the next stream evaluation we will await the already progressing future making awaiting the next response and processing current results go in parallel!

Here is a list of primitives which can help to achieve the desired behavior:

The text was updated successfully, but these errors were encountered:

StarlessNights · 2024-10-11T10:41:23Z

This would make sense especially when iterating on the items() stream (which I assume to be a common usecase). From that perspective it seems odd that there are these special elements (at page ends), after which getting the next one takes considerably longer (fetching the next page).

It's a bit unclear from the documentation whether pages_ahead fares any better in this regard? Does that just greedily try to get all the pages regardless of how many have been yielded?

PS. This is a nice project! There's definitely use for this kinda crate.

a1akris · 2024-10-26T23:39:15Z

From that perspective it seems odd that there are these special elements (at page ends), after which getting the next one takes considerably longer (fetching the next page).

Yes, this is exactly what I'm trying to solve in this issue, the new implementation should reduce(but not eliminate) fetch delays at page boundaries.

It's a bit unclear from the documentation whether pages_ahead fares any better in this regard? Does that just greedily try to get all the pages regardless of how many have been yielded?

Yes, certainly, and pages_ahead will be more optimal even compared to the next implementation of pages. Unlike pages, pages_ahead fetches first requests_ahead_count pages in parallel and as soon as it gets any page it immediately starts to fetch the next one. In other words while you're processing fetched items there are always requests_ahead_count fetch tasks running on the background which makes delays a lot smoother.

In general, for non-cursor pagination patterns you should always prefer pages_ahead/pages_ahead_unordered streams.

PS. This is a nice project! There's definitely use for this kinda crate.

Thank you! I'm already at the point where I cannot imagine how to work with AWS/Azure APIs or scrap sites without the help of this crate. It just reduces all the mess into a bunch of combinators or into a simple while loop :)

StarlessNights · 2024-10-29T08:37:20Z

Thanks for the explanation @a1akris.

Yes, certainly, and pages_ahead will be more optimal even compared to the next implementation of pages. Unlike pages, pages_ahead fetches first requests_ahead_count pages in parallel and as soon as it gets any page it immediately starts to fetch the next one. In other words while you're processing fetched items there are always requests_ahead_count fetch tasks running on the background which makes delays a lot smoother.

The one thing I'm a bit worried about with this is memory usage. If the processing of elements is slower than the page fetching and the iterated collection is huge, I'll end up with an unbounded amount of pages in memory? If this is correct then I'd feel better about having an argument for throttling the page fetching. E.g. something like ready_pages_limit.

a1akris · 2024-10-29T17:15:23Z

Good question! The argument you're referring to already exists: requests_ahead_count. I mentioned that fetching happens in the background in parallel, which is true because pages_ahead_* streams schedule multiple futures that query corresponding pages at the same time. However, the scheduling itself does not happen in the background! The scheduling logic triggers only when the .next() method is called therefore asymmetries in processing and fetching speeds don't matter!

Example

let mut items = std::pin::pin!(client.pages(4, req, Limit::None));


// Here the stream schedules 4(requests_ahead_count) futures 
// and as soon as the first one completes it schedules the 5th request 
// and returns you the first page to process
let page1 = items.next().await;

// While we're here, no extra futures can be scheduled in the background, 
// but futures that have already been scheduled are executing.
process_page(page1); // sleep for 300 seconds

// The 2nd request was certainly completed in 300 seconds, so you 
// get the second page immediately but before yielding results the stream 
// schedules the 6th request 
let page2 = items.next().await;

// Get 3rd page, schedule 7th request.
let page3 = items.next().await;

// and so on...
let page4 = process_page(items.next().await);

TL;DR

If turn_page allocates M bytes of memory then pages_ahead_* streams allocate (N + 1)M + C bytes of memory, where N is requests_ahead_count and C is a scheduler state allocated only once upon stream creation.

StarlessNights · 2024-10-30T13:12:53Z

Thanks again for the detailed explanation! Ya that makes sense. Perhaps a similar note about memory usage could be added to the docs too?

a1akris added the enhancement New feature or request label Oct 26, 2024

a1akris mentioned this issue Oct 27, 2024

Prepare for 2024 edition #9

Open

7 tasks

a1akris added sched:December Approx month when the feature should land next Should be included in the next release labels Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more optimal scheduling for pages stream #5

Implement more optimal scheduling for pages stream #5

a1akris commented Jan 19, 2024 •

edited

Loading

StarlessNights commented Oct 11, 2024

a1akris commented Oct 26, 2024

StarlessNights commented Oct 29, 2024

a1akris commented Oct 29, 2024 •

edited

Loading

StarlessNights commented Oct 30, 2024

Implement more optimal scheduling for pages stream #5

Implement more optimal scheduling for pages stream #5

Comments

a1akris commented Jan 19, 2024 • edited Loading

StarlessNights commented Oct 11, 2024

a1akris commented Oct 26, 2024

StarlessNights commented Oct 29, 2024

a1akris commented Oct 29, 2024 • edited Loading

Example

TL;DR

StarlessNights commented Oct 30, 2024

a1akris commented Jan 19, 2024 •

edited

Loading

a1akris commented Oct 29, 2024 •

edited

Loading