Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help deserialize mixed tags and string in body $value (html text formatting) #257

Open
Rudo2204 opened this issue Feb 4, 2021 · 20 comments
Labels
enhancement serde Issues related to mapping from Rust types to XML

Comments

@Rudo2204
Copy link

Rudo2204 commented Feb 4, 2021

I'm trying to deserialize some dictionary defitnitions and came across this one which contains mixed multiple tags with normal string (html text formatting).

<div style="margin-left:2em"><b>1</b> 〔学業・技術などの能力判定〕 an examination; a test; 《口》 an exam; 《米》 a quiz 《<i>pl</i>. quizzes》.</div>

I looked around in serde-xml-rs tests and tried this solution which seems to be close but it doesn't quite work

#[derive(Debug, Deserialize, PartialEq)]
struct DivDefinition {
    style: String,
    #[serde(rename = "$value")]
    definition: Vec<MyEnum>,
}

#[derive(Debug, Deserialize, PartialEq)]
enum MyEnum {
    b(String),
    #[serde(rename = "$value")]
    String,
    i(String),
}

The error I'm getting is:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom("unknown variant `〔学業・技術などの能力判定〕 an examination; a test; 《口》 an exam; 《米》 a quiz 《`, expected one of `b`, `$value`, `i`")'

I can make it work for now by not using MyEnum and just use definition: Vec<String>, but then I wouldn't know which text is bold and which is italic.
How can I properly deserialize this?

@dralley
Copy link
Collaborator

dralley commented Jul 26, 2023

Whoever picks this up, consider starting from #511

@lkirkwood
Copy link

Has anybody found a workaround for this? I am having the same issue.

@lkirkwood
Copy link

You can close this. Don't know when it was fixed but the original example works now with minor edits:

#[derive(Debug, Deserialize, PartialEq)]
struct DivDefinition {
    #[serde(rename = "@style")]
    style: String,
    #[serde(rename = "$value")]
    definition: Vec<MyEnum>,
}

#[derive(Debug, Deserialize, PartialEq)]
enum MyEnum {
    b(String),
    #[serde(rename = "$text")]
    String,
    i(String),
}

@enricozb
Copy link

enricozb commented Jan 4, 2024

Thoughts on this idea? enricozb@7b4b3f8

Specifically, I'm adding a new special field name $raw that can only deserialize into a String, and just writes all events, until the expected end event, into a string.

It lets you do stuff like this:

const xml: &str = r#"
  <who-cares>
    <foo property="value">
      test
      <bar><bii/><int>1</int></bar>
      test
      <baz/>
    </foo>
  </who-cares>
"#;

#[derive(Deserialize, Debug)]
struct Root {
  #[serde(rename = "$raw")]
  value: String,
}

let root = quick_xml::de::from_str::<Root>(&xml).unwrap();

println!("parsed: {root:?}");

This prints

parsed: Root { value: "<foo property=\"value\">test<bar><bii></bii><int>1</int></bar>test<baz></baz></foo>" }

One of the problems with this approach is that it doesn't save exactly what was in the XML file. This would be ideal because we could likely avoid any allocations, like serde_json::value::RawValue, and we would preserve formatting, and not trim spaces.

Another issue is that empty tags <bii/> get converted to <bii></bii> as that is how the events come in.

It's possible my initial idea could be fixed up to disable trimming temporarily of the reader during raw_string use.

@Mingun
Copy link
Collaborator

Mingun commented Jan 4, 2024

Deserialization of RawValue in serde_json implemented as deserialization of a newtype with a special name:
https://github.com/serde-rs/json/blob/0131ac68212e8094bd14ee618587d731b4f9a68b/src/de.rs#L1711-L1724

The deserializer then returns data from it's own buffer of directly from input string, depending on what type is deserialized (Box<RawValue> or &RawValue). We can do the same because we have read_text, but right now only for borrowing reader. We need to implement #483 in order to implement read_text_into needed for owned reader.

@enricozb
Copy link

enricozb commented Jan 4, 2024

Got it. I saw that private newtype name, but wasn't sure why it mattered. I see now that the json deserializer looks for this tag. I'll take a stab at this.

@enricozb
Copy link

enricozb commented Jan 4, 2024

Additionally, I'm not sure if we should capture the surrounding tags or not. What should this print:

struct AnyName {
  root: RawValue,
}

const xml: &str = "
  <root>
    <some/><inner/><tags/>
  </root>
";

let x: AnyName = from_str(xml)?;

println!("{}", x.value);

Should this print

<root>
  <some/><inner/><tags/>
</root>

or

<some/><inner/><tags/>

@NuSkooler
Copy link

NuSkooler commented Jul 23, 2024

Hi, I'm trying to track down a way to de-serialize unknown/arbitrary data under a specific tag and found my way here. Is this currently possible in any form?

I have something like this:

<root>
  <someTag> <!-- I am only aware of this tag -->
    <arbitraryTag1>
      <arbitraryTag2>...stuff...</arbitraryTag2>
      <anotherArbitraryTag>foo</anotherArbitraryTag>
    </arbitraryTag1>
  </someTag>
</root>

I simply need everything under someTag as a HashMap<String,String> ideally.

@Mingun
Copy link
Collaborator

Mingun commented Jul 23, 2024

If ...stuff and foo would contain only textual data, CDATA sections, comments (would be skipped) and processing instructions (also skipped), then I think it should be possible today. If them can contain markup (i.e. nested tags), then you cannot read them to String.

@NuSkooler
Copy link

NuSkooler commented Jul 23, 2024

@Mingun Thanks for the quick reply! I updated my example, it was missing some data.

Basically, under someTag, there is a nested structure starting with arbitraryTag1, but always key-value tags from there. I'd like to capture the name of arbitraryTag1 in some way, and HashMap<String, String> for the key-values.

@Mingun
Copy link
Collaborator

Mingun commented Jul 23, 2024

So in your example you expect HashMap with

  • key: arbitraryTag1
  • value:
        <arbitraryTag2>...stuff...</arbitraryTag2>
        <anotherArbitraryTag>foo</anotherArbitraryTag>

?
Or you need something like

// type of `someTag` field
struct SomeTagType {
  // filled with "arbitraryTag1"
  name: String,

  // filled with
  // - ("arbitraryTag2", "...stuff...")
  // - ("anotherArbitraryTag", "foo")
  // - ...
  fields: HashMap<String, String>,
}

?

Both are impossible right now. The first because we cannot capture markup to the String, the second because we (probably) cannot capture tag name as a value (there a separate issue for that -- #778).

@NuSkooler
Copy link

NuSkooler commented Jul 23, 2024

@Mingun thanks, the 2nd example is what I'm after.

Can you think of any workarounds?

@NuSkooler
Copy link

@Mingun Apologies for the "bump", I'm trying to determine where this stands exactly. #778 mentions something works, but I can't find it.

Ideally, I'm after the ability to capture arbitrary nested XML, similar to what a HashMap<String, serde_json::Value> can achieve with JSON (in fact, I need to turn them into JSON after)

I'm not 100% clear if this is the correct ticket, #778, or something else.

Thanks again!

@Mingun
Copy link
Collaborator

Mingun commented Jul 25, 2024

In #383 @alex-semov in the initial post gave a code that looks like what you need. Try experimenting with it. If you don't have to extract the attributes from <arbitraryTag1>, then it looks like it works.

@NuSkooler
Copy link

In #383 @alex-semov in the initial post gave a code that looks like what you need. Try experimenting with it. If you don't have to extract the attributes from <arbitraryTag1>, then it looks like it works.

Unfortunately we need to extract/convert arbitrary XML into a JSON representation in our case. Something like:

<xml>
  <foo><bar>123</bar></foo>
  <foobar someattr="thing"/>
  <bazfoo anotherattr="stuff">bazzle</bazfoo>
</xml>

to

{
  "foo": {
    "bar": 123
  },
  "foobar": {
    "@someattr": "thing"
  },
  "bazfoo": {
    "@anotherattr": "stuff",
    "@value": "bazzle"
  }
}

JSON structure is just an example, we just need a way to do it in some way.

@eirnym
Copy link

eirnym commented Sep 24, 2024

The best way is to represent some kind of DOM structure. This would give an option to manipulate XML as is and would give standatization with other languages which is a huge plus.

@Mingun
Copy link
Collaborator

Mingun commented Sep 24, 2024

I already notes in some related issues, that I have very WIP dom branch in my repository. Feel free to finish it, I do not think that I will work on it in near future.

@Ray-Eldath
Copy link

Ray-Eldath commented Oct 8, 2024

do we have any recommended workaround on this issue? from the discussion I conclude that the enum way is the only thing that works. I think $raw is pretty good but that seems haven't get merged.

@Ray-Eldath
Copy link

Ray-Eldath commented Oct 9, 2024

@Mingun @enricozb sorry to bother, but I'm very interested in using read_text or whatnot to achieve $raw in current version. Are there any way to use quick-xml reader abilities (like read_text) in a serde deserializer? something like

fn raw_de<'de, D>(deserializer: D) -> Result<String, D::Error>
where
    D: Deserializer<'de>,
{
    let element: Result<quick_xml::de::RawElement, _> = Deserialize::deserialize(deserializer);
    Ok(element.read_text())
}

will be of great value and flexibility (is this what you refer to as a DOM?)


to parse XML document

<title>
text <sub>1-<i>y</i></sub>
</title>

currently I have to mimic HTML-like representation by using many structs and a custom deserializer like:

#[derive(Deserialize, Debug)]
#[serde(rename_all(deserialize = "snake_case"))]
enum ItalicBoldString {
    Sup,
    Sub,
    I(String),
    B(ItalicBoldStringWrapper),
    #[serde(rename = "$text")]
    String(String),
}
#[derive(Deserialize, Debug)]
struct ItalicBoldStringWrapper {
    #[serde(rename(deserialize = "$value"), default)]
    field: Vec<ItalicBoldString>,
}
#[derive(Deserialize, Debug)]
#[serde(rename_all(deserialize = "snake_case"))]
enum CouldBeString {
    Sup(ItalicBoldStringWrapper),
    Sub(ItalicBoldStringWrapper),
    I(ItalicBoldStringWrapper),
    B(ItalicBoldStringWrapper),
    Math,
    #[serde(rename = "$text")]
    String(String),
}
#[derive(Deserialize, Debug)]
struct SegmentedString {
    #[serde(rename(deserialize = "$value"), default)]
    field: Vec<CouldBeString>,
}

fn traverse_ibs_wrapper(ibs: &ItalicBoldStringWrapper) -> String {
    ibs.field
        .iter()
        .map(|e| match &e {
            ItalicBoldString::I(str) => str.clone(),
            ItalicBoldString::B(str) => format!("{:?}", str),
            ItalicBoldString::String(str) => str.clone(),
            _ => "".to_string(),
        })
        .collect()
}

// Ok(SegmentedString::deserialize(deserializer)?.field.join(" "))
Ok(SegmentedString::deserialize(deserializer)?
    .field
    .iter()
    .map(|e| match e {
        CouldBeString::I(str) => traverse_ibs_wrapper(&str),
        CouldBeString::B(str) => traverse_ibs_wrapper(&str),
        CouldBeString::Sup(str) => traverse_ibs_wrapper(&str),
        CouldBeString::Sub(str) => traverse_ibs_wrapper(&str),
        CouldBeString::String(str) => str.clone(),
        &CouldBeString::Math => "".to_string(),
    })
    .map(|e| e.trim().to_string())
    .collect::<Vec<_>>()
    .join(" "))
}

in order to extract raw text or plain text from a xml node which is very tedious and ad-hoc (can only applied to a small subset of all possible combinations of the HTML tree). if I use something like enum CouldBeString { B(Box<CouldBeString>) } then it would stackoverflow. I wish there would be some way to stop the stackoverflow, or $raw, or $plain_text to save us from this.

@Mingun
Copy link
Collaborator

Mingun commented Oct 9, 2024

No, currently there is no way to do that. If you want this feature, consider to contributing to the implementation. I think, that something like what the serde_json does should be implemented.

if I use something like enum CouldBeString { B(Box<CouldBeString>) } then it would stackoverflow.

That is #819 which I accidentally discovered a couple of hours ago. Use struct variant with $value field as workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement serde Issues related to mapping from Rust types to XML
Projects
None yet
Development

No branches or pull requests

8 participants