You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
This bug is related to #17058
There are two scenarios in which the partial table constructed from a batch i.e., byte range across sources can result in an empty table:
When the last batch is an incomplete row: The following test fails with the error Mismatch in JSON schema across batches in multi-source multi-batch JSON reader
TEST_F(JsonReaderTest, Debug)
{
std::string data = R"(
{"a": "b"}
{"a": "b"}
{"a": "b"}
{"a": "b"}
)";
setenv("LIBCUDF_JSON_BATCH_SIZE", std::to_string(data.size() - 5).c_str(), 1);
auto opts =
cudf::io::json_reader_options::builder(cudf::io::source_info{data.data(), data.size()})
.lines(true)
.build();
auto res = cudf::io::read_json(opts);
unsetenv("LIBCUDF_JSON_BATCH_SIZE");
}
Since the penultimate batch reads the last row of the source, the byte range read in the last batch is empty and hence constructs an empty table.
When a batch contains a single JSON line of size > INT_MAX: In this case again, the partial table would be empty. However, the INT_MAX size constraint on the input buffer passed to the tokenizer precludes this error from appearing for now. A related corner case is documented in [BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most INT_MAX bytes #17058, where the batch contains multiple lines, the last of which is incomplete and results in a buffer of size greater than 2GB chars.
Expected behavior for the scenarios listed above
Skip the empty table while concatenating partial tables from batches
Error out with a CUDF_EXPECTS
The text was updated successfully, but these errors were encountered:
Describe the bug
This bug is related to #17058
There are two scenarios in which the partial table constructed from a batch i.e., byte range across sources can result in an empty table:
Mismatch in JSON schema across batches in multi-source multi-batch JSON reader
Since the penultimate batch reads the last row of the source, the byte range read in the last batch is empty and hence constructs an empty table.
INT_MAX
: In this case again, the partial table would be empty. However, theINT_MAX
size constraint on the input buffer passed to the tokenizer precludes this error from appearing for now. A related corner case is documented in [BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at mostINT_MAX
bytes #17058, where the batch contains multiple lines, the last of which is incomplete and results in a buffer of size greater than 2GB chars.Expected behavior for the scenarios listed above
CUDF_EXPECTS
The text was updated successfully, but these errors were encountered: