Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more optimal scheduling for pages stream #5

Open
a1akris opened this issue Jan 19, 2024 · 5 comments
Open

Implement more optimal scheduling for pages stream #5

a1akris opened this issue Jan 19, 2024 · 5 comments
Labels
enhancement New feature or request next Should be included in the next release sched:December Approx month when the feature should land

Comments

@a1akris
Copy link
Owner

a1akris commented Jan 19, 2024

Currently the pages stream is an equivalent of a hand-written loop that can look something like this:

let mut request = SomeRequest::default();

loop {
    let response = client.send(request).await?;
    
    process(response.extract_data());

    match response.token {
        Some(next_page_token) => { 
            request = SomeRequest::from(next_page_token);
        }
        None => break,
    } 
}

The loop above sends the request awaits the response and process results sequentially. However, there is a more optimal way of doing the same things which should be fairly easy to implement by modifying the current request_next_page impl. If there exists a next page we can send a request for it immediately, even before yielding the current response results, and at the next stream evaluation we will await the already progressing future making awaiting the next response and processing current results go in parallel!

Here is a list of primitives which can help to achieve the desired behavior:

@StarlessNights
Copy link

This would make sense especially when iterating on the items() stream (which I assume to be a common usecase). From that perspective it seems odd that there are these special elements (at page ends), after which getting the next one takes considerably longer (fetching the next page).

It's a bit unclear from the documentation whether pages_ahead fares any better in this regard? Does that just greedily try to get all the pages regardless of how many have been yielded?

PS. This is a nice project! There's definitely use for this kinda crate.

@a1akris
Copy link
Owner Author

a1akris commented Oct 26, 2024

From that perspective it seems odd that there are these special elements (at page ends), after which getting the next one takes considerably longer (fetching the next page).

Yes, this is exactly what I'm trying to solve in this issue, the new implementation should reduce(but not eliminate) fetch delays at page boundaries.

It's a bit unclear from the documentation whether pages_ahead fares any better in this regard? Does that just greedily try to get all the pages regardless of how many have been yielded?

Yes, certainly, and pages_ahead will be more optimal even compared to the next implementation of pages. Unlike pages, pages_ahead fetches first requests_ahead_count pages in parallel and as soon as it gets any page it immediately starts to fetch the next one. In other words while you're processing fetched items there are always requests_ahead_count fetch tasks running on the background which makes delays a lot smoother.

In general, for non-cursor pagination patterns you should always prefer pages_ahead/pages_ahead_unordered streams.

PS. This is a nice project! There's definitely use for this kinda crate.

Thank you! I'm already at the point where I cannot imagine how to work with AWS/Azure APIs or scrap sites without the help of this crate. It just reduces all the mess into a bunch of combinators or into a simple while loop :)

@a1akris a1akris added the enhancement New feature or request label Oct 26, 2024
@a1akris a1akris mentioned this issue Oct 27, 2024
7 tasks
@a1akris a1akris added sched:December Approx month when the feature should land next Should be included in the next release labels Oct 27, 2024
@StarlessNights
Copy link

Thanks for the explanation @a1akris.

Yes, certainly, and pages_ahead will be more optimal even compared to the next implementation of pages. Unlike pages, pages_ahead fetches first requests_ahead_count pages in parallel and as soon as it gets any page it immediately starts to fetch the next one. In other words while you're processing fetched items there are always requests_ahead_count fetch tasks running on the background which makes delays a lot smoother.

The one thing I'm a bit worried about with this is memory usage. If the processing of elements is slower than the page fetching and the iterated collection is huge, I'll end up with an unbounded amount of pages in memory? If this is correct then I'd feel better about having an argument for throttling the page fetching. E.g. something like ready_pages_limit.

@a1akris
Copy link
Owner Author

a1akris commented Oct 29, 2024

Good question! The argument you're referring to already exists: requests_ahead_count. I mentioned that fetching happens in the background in parallel, which is true because pages_ahead_* streams schedule multiple futures that query corresponding pages at the same time. However, the scheduling itself does not happen in the background! The scheduling logic triggers only when the .next() method is called therefore asymmetries in processing and fetching speeds don't matter!

Example

let mut items = std::pin::pin!(client.pages(4, req, Limit::None));


// Here the stream schedules 4(requests_ahead_count) futures 
// and as soon as the first one completes it schedules the 5th request 
// and returns you the first page to process
let page1 = items.next().await;

// While we're here, no extra futures can be scheduled in the background, 
// but futures that have already been scheduled are executing.
process_page(page1); // sleep for 300 seconds

// The 2nd request was certainly completed in 300 seconds, so you 
// get the second page immediately but before yielding results the stream 
// schedules the 6th request 
let page2 = items.next().await;

// Get 3rd page, schedule 7th request.
let page3 = items.next().await;

// and so on...
let page4 = process_page(items.next().await);

TL;DR

If turn_page allocates M bytes of memory then pages_ahead_* streams allocate (N + 1)M + C bytes of memory, where N is requests_ahead_count and C is a scheduler state allocated only once upon stream creation.

@StarlessNights
Copy link

Thanks again for the detailed explanation! Ya that makes sense. Perhaps a similar note about memory usage could be added to the docs too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request next Should be included in the next release sched:December Approx month when the feature should land
Projects
None yet
Development

No branches or pull requests

2 participants