Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Batched multi-source JSON reader does not require each batch to contain at least one JSON line #17836

Open
shrshi opened this issue Jan 28, 2025 · 0 comments · May be fixed by #17837
Open
Labels
bug Something isn't working

Comments

@shrshi
Copy link
Contributor

shrshi commented Jan 28, 2025

Describe the bug
This bug is related to #17058
There are two scenarios in which the partial table constructed from a batch i.e., byte range across sources can result in an empty table:

  1. When the last batch is an incomplete row: The following test fails with the error Mismatch in JSON schema across batches in multi-source multi-batch JSON reader
TEST_F(JsonReaderTest, Debug)
{
  std::string data = R"(
  {"a": "b"}
  {"a": "b"}
  {"a": "b"}
  {"a": "b"}
  )";
  setenv("LIBCUDF_JSON_BATCH_SIZE", std::to_string(data.size() - 5).c_str(), 1);
  auto opts =
    cudf::io::json_reader_options::builder(cudf::io::source_info{data.data(), data.size()})
      .lines(true)
      .build();
  auto res = cudf::io::read_json(opts);
  unsetenv("LIBCUDF_JSON_BATCH_SIZE");
}

Since the penultimate batch reads the last row of the source, the byte range read in the last batch is empty and hence constructs an empty table.

  1. When a batch contains a single JSON line of size > INT_MAX: In this case again, the partial table would be empty. However, the INT_MAX size constraint on the input buffer passed to the tokenizer precludes this error from appearing for now. A related corner case is documented in [BUG] Limit size of buffer read by batched multi-source JSON lines reader to be at most INT_MAX bytes #17058, where the batch contains multiple lines, the last of which is incomplete and results in a buffer of size greater than 2GB chars.

Expected behavior for the scenarios listed above

  1. Skip the empty table while concatenating partial tables from batches
  2. Error out with a CUDF_EXPECTS
@shrshi shrshi added the bug Something isn't working label Jan 28, 2025
@shrshi shrshi linked a pull request Jan 28, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant