Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter away empty fields/subfields after input #165

Open
kaplun opened this issue Jul 13, 2016 · 6 comments
Open

filter away empty fields/subfields after input #165

kaplun opened this issue Jul 13, 2016 · 6 comments

Comments

@kaplun
Copy link
Member

kaplun commented Jul 13, 2016

Problem

Currently, utils.filter_values() is filtering away keys and corresponding values from dictionaries where value is None.

This concretely means, e.g. in the context of MARC21 conversion to JSON, that subfields with empty strings would be preserved, datafields with no subfields would be preserved.

Proposal

If we assume that an empty string in the bibliographic metadata context doesn't carry any valuable information, it is proposed that filter_values actually filters away any key whose value is:

  • evaluate to False
  • unless it's the False value itself (thus representing flag set to false) or the 0 number

Usecases

According to TIND, @Kennethhole reports:

I can confirm that TIND does not intend to use empty fields. However, it is highly likely that there are empty subfields in our databases and we prefer that dojson don't break due to that! From our point of view, these subfields can be removed during the conversion.

Related to INSPIRE, I can confirm that we have no use for empty values and we internally went further and have implemented a function that recursive visit the whole record and strips away also empty list and empty dicts that result from having filtered values.
https://github.com/inspirehep/inspire-next/blob/master/inspirehep/dojson/utils/__init__.py#L206

See also:

@tiborsimko
Copy link
Member

... in other words, do we want to support MARC21 records containing "empty fields" such as:

<datafield tag="123" ind1="4" ind2="5">
</datafield>

and "empty subfields" such as:

<datafield tag="123" ind1="4" ind2="5">
  <subfield code="a">Foo</subfield>
   <subfield code="b"></subfield>
</datafield>

or do we want to always remove these empty fields/subfields?

CC @aw-bib @martinkoehler @fjorba @jma @basaglia

CC @inveniosoftware/triagers

@aw-bib
Copy link

aw-bib commented Jul 14, 2016

Just crosschecked with our librarians to be sure not to miss esotheric cases:

  • Empty fields are not valid and should be removed
  • Empty subfields are not valid and should be removed

As for TINDs comment: our librarians confirmed that e.g. Aleph allows to load empty fields/subfields on ingestion of external data. (I.e. bibupload on the shell.) However, Alephs bibedit would remove any of these fields silently and automatically once a cataloguer opens and stores such a record. That is, even if you deliberately add an empty field/subfield in Alephs bibedit you can not save it. Thus, you can not rely on the fact that an empty field is preserved in this commercial system, simply as soon as a cataloguer touches such a record these fields get stripped. (IMHO Aleph is at least inconsistent here. With a tendency to strip.)

@kaplun
Copy link
Member Author

kaplun commented Jul 19, 2016

OK. Given the above and:

@jirikuncar
Copy link
Member

Then we should have a specific filter_values decorator just for MARC21. Or simply add new filter for command line that removes empty values.

@kaplun
Copy link
Member Author

kaplun commented Jul 19, 2016

Such as the general one we are using in INSPIRE? https://github.com/inspirehep/inspire-next/blob/master/inspirehep/dojson/utils/__init__.py#L245

@tiborsimko
Copy link
Member

Yes, I think we can close this RFC to say that empty values in fields/subfields should be "tolerated" on the input upload side, but that we can delete them internally as soon as we spot them.

@tiborsimko tiborsimko changed the title RFC: default filter_values() behavior filter away empty fields/subfields after input Aug 1, 2016
@kaplun kaplun removed their assignment Sep 29, 2016
@tiborsimko tiborsimko modified the milestones: v1.4.0, v1.3.0 Feb 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants