You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
JSON is notoriously semistructured, but as far as I can tell arrow-json's ReaderBuilder only allows a very strict parsing of JSON rows to a specific homogenous schema. There is an option to ignore unwanted columns (with_strict_mode(false)), and for converting boolean/numeric values to strings (with_coerce_primitive(true)), but any other kind of schema mismatch on even a single row produces a hard error for the whole batch.
Describe the solution you'd like
It would be nice to have some ReaderBuilder option that converted wrong-type values to NULL instead of forcing a parsing error. For example, Spark's (woefully underdocumented) from_json function does this by default.
NOTE: This is NOT a request to handle malformed JSON. If it can't even json-parse it should be an error. This is about handling the case where e.g. the schema requested an int and the JSON provides an array of ints.
Schema disagreements like this frequently come up due to versioning issues or disagreements between different parties exchanging data in JSON format; in such cases, getting partial data is arguably better than getting nothing at all.
Is this something that could be considered desirable to support?
Describe alternatives you've considered
One alternative might be to request a schema with all-string leaf fields and manually parse values afterward. Spark's from_json has such an option. But that doesn't handle the case where the data provides an array or object where the schema expected a leaf, and it also doesn't handle the case where the schema expected a non-leaf column like a struct and got an array or primitive instead.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
JSON is notoriously semistructured, but as far as I can tell arrow-json's
ReaderBuilder
only allows a very strict parsing of JSON rows to a specific homogenous schema. There is an option to ignore unwanted columns (with_strict_mode(false)
), and for converting boolean/numeric values to strings (with_coerce_primitive(true)
), but any other kind of schema mismatch on even a single row produces a hard error for the whole batch.Describe the solution you'd like
It would be nice to have some
ReaderBuilder
option that converted wrong-type values to NULL instead of forcing a parsing error. For example, Spark's (woefully underdocumented)from_json
function does this by default.NOTE: This is NOT a request to handle malformed JSON. If it can't even json-parse it should be an error. This is about handling the case where e.g. the schema requested an int and the JSON provides an array of ints.
Schema disagreements like this frequently come up due to versioning issues or disagreements between different parties exchanging data in JSON format; in such cases, getting partial data is arguably better than getting nothing at all.
Is this something that could be considered desirable to support?
Describe alternatives you've considered
One alternative might be to request a schema with all-string leaf fields and manually parse values afterward. Spark's
from_json
has such an option. But that doesn't handle the case where the data provides an array or object where the schema expected a leaf, and it also doesn't handle the case where the schema expected a non-leaf column like a struct and got an array or primitive instead.Additional context
Some examples where it would be helpful to tolerate partially incompatible schemas:
delta-io/delta#2419
delta-io/delta-kernel-rs#501
delta-io/delta-kernel-rs#712
The text was updated successfully, but these errors were encountered: