-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON column containing object or array of objects gets quoted as a string - breaks parsing downstream #221
Comments
I have the same issue from Rust, using the tokio-postgres crate. For However, that is not the case for arrays! Arrays are stored in some other format, where a single value (without quotes in it) looks like this:
whereas others look like this:
This inconsistency is annoying, and makes custom parsing harder. Prior to an official resolution of this issue, does anyone know the name of the format used above, so I can find a parser for it? (I'd rather not rely on my own attempt to escape quotes and such properly) |
As I don't know the name of the format, I made a quick attempt at writing a parser for it in Rust. I am very new to Rust, so the only way I was able to get it to compile was by unrolling the Anyway, here it is: use super::type_aliases::JSONValue;
#[cfg(test)]
mod tests {
use crate::utils::postgres_parsing::parse_postgres_array_as_strings;
#[test]
fn simple() {
let simple_source = "{example}";
let simple_result = parse_postgres_array_as_strings(simple_source);
assert_eq!(simple_result, vec!["example"]);
}
#[test]
fn escaped() {
let escaped_source = r#"{"example \"text\" with quotes in it",123123}"#;
let escaped_result = parse_postgres_array_as_strings(escaped_source);
assert_eq!(escaped_result, vec![r#"example "text" with quotes in it"#, "123123"]);
}
}
/// See: https://github.com/eulerto/wal2json/issues/221#issuecomment-1025143441
/// View the tests above for examples, and intended functionality.
pub fn parse_postgres_array(array_str: &str) -> JSONValue {
let result_as_strings: Vec<String> = parse_postgres_array_as_strings(array_str);
let result_as_json_value_strings = result_as_strings.into_iter().map(|a| serde_json::Value::String(a)).collect();
let result_within_json_array = serde_json::Value::Array(result_as_json_value_strings);
result_within_json_array
}
pub fn parse_postgres_array_as_strings(array_str: &str) -> Vec<String> {
let chars_struct = array_str.chars();
let chars = chars_struct.collect::<Vec<char>>();
let mut result_as_strings: Vec<String> = vec![];
let mut in_quote = false;
let mut in_entry = false;
let mut last_char_was_escape_backslash = false;
//let mut current_entry_str: Option<String> = None;
let mut current_entry_str: String = String::new(); // empty means none
/*let mut end_current_entry = || {
result_as_strings.push(current_entry_str.unwrap());
current_entry_str = None;
in_quote = false;
in_entry = false;
};*/
//for (i, ch) in chars.enumerate() {
//let chars_length = chars.into_iter().count();
let chars_length = chars.len();
let mut i = 0;
for ch in chars {
match last_char_was_escape_backslash {
true => {
last_char_was_escape_backslash = false;
//current_entry_str.unwrap().push(ch);
current_entry_str.push(ch);
}
false => {
match ch {
'{' if i == 0 => {},
'}' if i == chars_length - 1 => {
//if current_entry_str.is_some() {
if current_entry_str.len() > 0 {
//end_current_entry();
{
/*result_as_strings.push(current_entry_str.unwrap());
current_entry_str = None;*/
result_as_strings.push(current_entry_str);
current_entry_str = String::new();
in_quote = false;
in_entry = false;
}
}
},
'\\' => {
last_char_was_escape_backslash = true;
},
'"' => {
in_quote = !in_quote;
// if just left a quote
if !in_quote {
//end_current_entry();
{
result_as_strings.push(current_entry_str);
current_entry_str = String::new();
in_quote = false;
in_entry = false;
}
}
},
// ie. if just left a quote
',' if !in_entry => {},
// if hit a separator after a non-quoted entry
',' if in_entry && !in_quote => {
//end_current_entry();
{
result_as_strings.push(current_entry_str);
current_entry_str = String::new();
in_quote = false;
in_entry = false;
}
},
_ => {
// if hit start of entry
//if current_entry_str.is_none() {
if current_entry_str.len() == 0 {
//current_entry_str = Some(String::new());
current_entry_str = String::new();
in_entry = true;
}
current_entry_str.push(ch);
}
};
},
};
i += 1;
}
result_as_strings
} It seems to work for the basic cases I've tried, but I think it's likely there are cases it fails on. That's part of why it'd be nice to have wal2json emit valid objects/arrays for jsonb fields -- or at least a valid json string, so callers can parse the data without writing brittle parse functions like the above. |
…unctionality! (it was more straightforward than I expected) * Made XLogData message parsing more robust. (old version broke in a case where a single "{" char showed up in the binary portion) * Added parsing of the weird structure that wal2json uses for representing arrays. (see: eulerto/wal2json#221)
I tried the following example: CREATE TABLE xyz (
a integer,
b hstore,
c jsonb,
primary key(a)
);
INSERT INTO xyz (a, b, c) VALUES(1, 'a=>x, b=>y'::hstore, '{"name":"account_no_history","value":"[{"acc":"3597","createdby":"Viztek,Pacs","createddt":"2015-09-24T17:14:30.728Z"}]"}'::jsonb); This example also covers issue #222 . In format v1, I get:
In format v2, I get:
What is your version? |
We have the same issue with parsing stringified JSON values which makes is a little bit more complex to handle our format before the export. Is there maybe also a plan to support native native JSON objects for JSON column types? |
Using the v1 format on postgres 15, the output from a jsonb column is correct (though stringified) for me: create table test(word text, doc jsonb);
commit;
insert into test values ('abc', '[1, 2, null, "abc"]');
commit; Results in this output {"change":[{"kind":"insert","schema":"public","table":"test","columnnames":["word","doc"],"columntypes":["text","jsonb"],"columnvalues":["abc","[1, 2, null, \"abc\"]"]}]} I can run it through $ cat /tmp/wal2json.sample | jq -r '.change[0].columnvalues[1]' | jq
[
1,
2,
null,
"abc"
] I used this dockerfile to test: FROM postgres:15.5-bullseye
RUN apt-get update && apt-get install -y postgresql-15-wal2json
CMD ["postgres", "-c", "wal_level=logical"] It'd definitely be nice to get jsonb fields "natively" parsed, but I wonder if this ticket should be a feature request instead of a bug? Edit: Also tested with |
This issue is talking about JSON columns, not JSONB |
Ah sorry about that. Misread the issue then |
Slack thread: https://meltano.slack.com/archives/C06A1MD6A6L/p1724619858505739 When using log based replication, the wal2json output for columns of array types returns a string encoded in sql format. Ex: '{a,b}' The records produced by the tap in this sutation fails the target schema validation, since the schema is of type array and the value is a string. I included a test case for it. Please feel free to modify the PR, I am by no means a python developer. Related: - eulerto/wal2json#221 (comment) --------- Co-authored-by: Edgar Ramírez Mondragón <[email protected]>
Updating a table with JSON columns will treat the row data as a standard string, adding erroneous double-quotes around the value, instead of returning as an array/object.
Example result (from
.columns
):This throws when attempting to parse (tried
jq
and Node)The text was updated successfully, but these errors were encountered: