You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently we use Arrow as part of our public API in Lance. RecordBatchReader is extremely useful. However, there are times we would like an asynchronous version. There is datafusion's RecordBatchStream and we have our own equivalent in lancedb (also called RecordBatchStream for better or worse). The reason we have our own is that we don't want to make datafusion a part of the public API just to keep the API simpler. Transferring between the various endpoints we have a lot of conversion from arrow's error to datafusion's error to lancedb's error.
I'm mainly opening this issue in the interest of discussion, to see if this is something we'd be willing to add. If so, I can put together a proposal PR.
Describe the solution you'd like
// Pretty much identical to datafusion's `RecordBatchStream` except using arrow's `Result`
pub trait RecordBatchStream: Stream<Item = Result<RecordBatch>> {
fn schema(&self) -> Arc<Schema>;
}
Describe alternatives you've considered
As far as I can tell the biggest drawback would be the introduction of futures as a dependency. This could be feature-gated.
I'm not sure how I feel about that but I don't think futures is going to be absorbed into std anytime soon. We could even still have a futures trait that provides an impl futures::Stream for RecordBatchStream.
Additional context
The text was updated successfully, but these errors were encountered:
I think the need for this is pretty well demonstrated elsewhere in the ecosystem, given the use in both Lance and DataFusion, and I'm sure other places.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently we use Arrow as part of our public API in Lance. RecordBatchReader is extremely useful. However, there are times we would like an asynchronous version. There is datafusion's RecordBatchStream and we have our own equivalent in lancedb (also called RecordBatchStream for better or worse). The reason we have our own is that we don't want to make datafusion a part of the public API just to keep the API simpler. Transferring between the various endpoints we have a lot of conversion from arrow's error to datafusion's error to lancedb's error.
I'm mainly opening this issue in the interest of discussion, to see if this is something we'd be willing to add. If so, I can put together a proposal PR.
Describe the solution you'd like
Describe alternatives you've considered
As far as I can tell the biggest drawback would be the introduction of
futures
as a dependency. This could be feature-gated.Alternatively, we could vendor the
Stream
trait:I'm not sure how I feel about that but I don't think
futures
is going to be absorbed intostd
anytime soon. We could even still have afutures
trait that provides animpl futures::Stream for RecordBatchStream
.Additional context
The text was updated successfully, but these errors were encountered: