Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LINDT units of measure #129

Open
VladimirAlexiev opened this issue Nov 3, 2020 · 67 comments
Open

LINDT units of measure #129

VladimirAlexiev opened this issue Nov 3, 2020 · 67 comments

Comments

@VladimirAlexiev
Copy link
Contributor

VladimirAlexiev commented Nov 3, 2020

Why?

It's hard to work with quantities (value + UoM) in RDF and SPARQL.

There are about 10 UoM ontologies:

  • the worse of them are just a list of units
  • the better of them also add dimensionality analysis (eg length is L^1, area is L^2) and conversion factors (eg from cm to m, from degF to degC)

Working with units in SPARQL is quite hard. Comparing compatible units or doing arithmetics on units is possible if you are working with one of the better ontologies, but difficult. You have to fetch the dimension vectors and conversion factors and work with them, and the queries become very complex.

SHACL's modest arithmetic capabilities (eg minInclusive to compare to constant, lessThan to compare two props) borrow from SPARQL, so it's impossible to state "temperature should be between 0 and 10 degC", see https://lists.w3.org/Archives/Public/public-shacl/2020Nov/0001.html

But there is one approach that solves these problems.

Previous work

  • The Unified Code for Units of Measure (UCUM) http://unitsofmeasure.org/ucum.html codifies all kinds of units and guarantees unambiguous interpretation (no unit label conflicts).
  • The Java UCUM library implements this system
  • Linked Data Types (LINDT) uses this and SPARQL datatype handlers to implement it in SPARQL

LINDT is unique in that it encodes both value and unit in one literal, eg "1 m"^^cdt:ucum, "100 cm"^^cdt:ucum. This is economical, but more importantly you can compare such quantities, and you can also do arithmetic operations on quantities.

This would be very useful for any sort of application in engineering, smart cities, semantic sensor networks, WoT, etc.

  • @maximelefrancois86 tells me it even supports complex numbers, which are important in some electricity applications, is that correct? Can you give an example?

Features https://ci.mines-stetienne.fr/lindt/v2/custom_datatypes.html#on-apache-jena

  • Overload of SPARQL operators (=, <, etc.) to compare measurement literals;
  • Overload of algebraic functions (+, -, *, /) to manipulate measurement literals:
    • Add two commensurable measurement literals
    • Subtract a measurement literals to a commensurable one
    • Multiply two measurement literals, or a measurement literal and a scalar (xsd:int, xsd:decimal, xsd:float, xsd:double)
    • Divide a measurement literal by a measurement literal, a measurement literal by a scalar, or a scalar by a measurement literal
  • Custom SPARQL function lindt:sameDimension(arg1,arg2) to check if two measurement literals are commensurable (returns a xsd:boolean).
  • Cast to XSD numeric datatypes
  • dynamic loading of new datatypes/units

LINDT is very ingenious and it's a pity that it hasn't found a wider following.

  • It's implemented as a Jena branch but hasn't been merged into trunk: "This branch is 14 commits ahead, 5325 commits behind apache:master."
  • We have been looking for a pretext (i.e. client) to implement it in rdf4j.
  • It's adopted in some ontologies, but these are very few:
    • CoCoOn: Cloud Computing Ontology for IaaS Price and Performance Comparison
    • VSSo: Vehicle Signal and Attribute Ontology
    • others?

Proposed solution

Adopt LINDT as a best practice for representing units.
Work with other communities (WoT, semantic sensors) to also adopt it.

Considerations for backward compatibility

No direct consequences because it uses custom datatype handlers to do its work. I.e. if you don't use the CDT datatypes (cdt:ucum, cdt:length, etc) you'll see no difference.

However, guidance and solution templates for migrating from other systems for representing units should be provided

@JervenBolleman
Copy link
Collaborator

JervenBolleman commented Nov 3, 2020

I have been thinking about this exact thing. However, I was thinking that all of UCUM might be to much work to demand, plus currently awkward licensing wise (section 7 of the license). So I was wondering if the 7 base Units of the International Systems of Units plus the 21 coherent derived named units might be a sufficient lower bound for implementation.

unit proposed datatype example of idea
second unit:s 60^^unit:s
meter unit:m 1.99^^unit:m
kilogram unit:kg 88^^unit:kg
Ampere unit:A
Kelvin unit:K 273.1^^unit:K
mol unit:mol
candelad unit:J
hertz unit:Hz
radian unit:rad
steradian unit:sr
newton unit:N
pascal unit:Pa
joule unit:J
watt unit:W
coulomb unit:C
volt unit:V
farad unit:F
ohm unit:Ω
siemens unit:S
weber unit:Wb
tesla unit:T
henry unit:H
lumen unit:lm
lux unit:lx
becquerel unit:Bq
gray unit:Gy
sievert unit:Sv
katal unit:kat

Implementation advantage is these are simple numeric values so as long as datatype is the same cast to decimal and compare.

There are no prefixes, all must be converted to the base unit. e.g. 600km should be stored as "60000"^^unit:m.
Derived units must always be in coherent form for this to work (i.e. also not have prefixes).
Advantage of scaling down to base units is simpler comparison functions, and easier to generate indexes.

The list misses Celsius, which is trivial to add, but would need to be comparable to Kelvin.
Also ohm symbol Ω is outside of ascii so unit:Ohm could be an option.

Basically, the cost is in converting to base units at storage. However, consistent storage makes it easier to build indexes and ensure query results are correct.

This of course leaves out all the derived units e.g. kg/m m/s^2 etc. which I think would be very worth while to have but might be to much work to implement all commonly used ones. Unless they follow a straightforward pattern in IRI encoding that can be decoded and generated by stores on the fly. e.g. an option division "1"^^unit:kg / "1"^^unit:m => "1"^^unit:kg-per-m and "2"^^unit:m * "2"^^unit:m => "4"^^unit:m2

Also the use of data types avoids one of the issues with UCUM is that conflicts in coding exists, and that coding for customary units is not straightforward. See Fluid Ounce. Where a datatype pointing to the fluid ounce definition would be a clearer option.
Specifically as there are more legal redefinitions of fluid ounce for legal reasons. It's 30 ml or 23 1/3 grams of pure alcohol in some US food standards (TODO: find again an article showing the many redefinitions of US fluid ounce). UCUM is widely specified in clinical settings but in reality not always used (even where it was specified).

Side notes

  • all unit values should inherit from xsd:decimal.
  • we should have power and square root operators

@VladimirAlexiev
Copy link
Contributor Author

@JervenBolleman what licensing problems do you see?

UCUM has customary units that are important in many disciplines. If you ask users to always store angstroms and light-years as meters, you are shifting burden to them.

And even "decorative" units like 123 {rbc} which is a count (dimensionless) but of "red blood cells".

LINDT is implemented in Jena, and @jeenbroekstra says won't be too hard to port to rdf4j. So "Downgrading" to your proposal will be more work for java devs.

So that leaves other languages. The suggestion is whether there are UCUM libraries in other languages?

@HolgerKnublauch
Copy link

HolgerKnublauch commented Nov 3, 2020

I much prefer using explicit datatypes to encoding the unit into the string. E.g. "1"^^unit:m is better than "1 m"^^ucum:unit.

But a more general solution might be to offer a declarative extension point so that anyone can define custom datatypes and those datatypes actually can be used consistently. This could work similar to user-defined SHACL constraint components or SHACL-AF functions. Just some very quick thoughts, a datatype might need to be able to respond to questions "can I compare my value to another datatype" (e.g. yes for mm to m comparison), and then a normalize function that would bring all datatypes from a group to a common base unit, e.g. meter. Then things like < comparison in SPARQL can be automated. The actual business logic can probably be covered declaratively through a couple of properties that are attached to the units as done (comprehensively) in the QUDT vocabulary.

The advantage here is that a SPARQL 1.2 would only need to implement a few generic building blocks while the details of the specific datatypes are irrelevant, and we don't even need to discuss the specific catalog of datatypes that need to be implemented.

@VladimirAlexiev
Copy link
Contributor Author

Connecting UCUM unit symbols to ontologies like QUDT or OM is important and useful because they expose as triples info that is within the UCUM library (eg the dimension vector if Newton and the conversion factors of Farenheit). Plus extra info, eg that Inch is an imperial unit, grouping of units by discipline, etc. I believe QUDT or OM already has ucum codes, so that should not be hard.

So we could spec custom functions to parse out a unit from a quantity, and connect to a structured unit node in such an ontology.

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Nov 5, 2020

@HolgerKnublauch

"1"^^unit:m is better than "1 m"^^ucum:unit

In addition to the generic cdt:ucum, LINDT also has datatypes ucum:length, ucum:mass etc that represent quantities with fixed/known dimension.

But I see some problems with having distinct datatypes for each unit:

  • there are just too many. It's nearly a combinatorial explosion. Eg "barrels per day", "US barrels per hour", etc etc.
    • Consider that just for dimensionless units, there are many variations such as percent, promile, ppm (parts per million...).
    • There are also annotations (advisory customary pieces), eg {rbc} (red blood cells), {pair} or {pairs} for socks, {packs} vs {masterboxes} for cigarettes, s {0..100 km/h} for car acceleration expressed as time to reach that speed, etc: see https://ucum.org/ucum.html#para-6
  • units use special symbols that will be unwieldy in URL local names or will become unreadable if you URL-encode them. Eg what datatype URLs would you translate the following units to? (they happen to express the same unit):
"km.h-1"^^cdt:ucumunit
"km/h"^^cdt:ucumunit
"(1000m)/(60min)"^^cdt:ucumunit

offer a declarative extension point

LINDT does that:


BTW @maximelefrancois86 there are Broken links at https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes:

In contrast, both of https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes and https://ci.mines-stetienne.fr/lindt/v3/custom_datatypes exist.

@namedgraph
Copy link

"(1000m)/(60min)"^^cdt:ucumunit -- this is simply not a structured way of describing units, which goes against the RDF practice.

@VladimirAlexiev
Copy link
Contributor Author

@namedgraph This unit is well structured and well defined according to https://ucum.org/ucum.html (and implemented in Java UCUM and consequently in LINDT).
It's just not structured in RDF.

NOT everything needs or should be structured in RDF. Eg are you against GeoSPARQL literals (WKT and GML)?

@maximelefrancois86
Copy link

Dear all,

@namedgraph , there are sometimes good rationale to encode complex values using literals instead of relying on RDF structures and basic datatypes. The OGC GeoSPARQL datatype geo:WKTLiteral is a great example.

"<http://www.opengis.net/def/crs/OGC/1.3/CRS84> Polygon((-83.6 34.1, -83.6 34.5, -83.2 34.5, -83.2 34.1, -83.6 34.1))”^^geo:WKTLiteral

I back up the thoughts of @VladimirAlexiev :

  • I think having a unique datatype cdt:ucum would be the most simple choice. SPARQL engines would only need to recognise one additional datatype IRI, and could hand on to the UCUM specification for the list of base units, and how compound units can be formed. There exist implementations of UCUM in common programming languages.
  • In our implementation on apache Jena, we also included:
    • overload of SPARQL operators (=, <, etc.) to compare measurement literals;
    • overload of algebraic function (+, -, *, /) to manipulate measurement literals:
    • a custom SPARQL function with IRI: http://w3id.org/lindt/custom_datatypes#sameDimension(arg1, arg2) to check if two measurement literals are commensurable (returns a xsd:boolean).
    • cast to XSD numeric datatypes

@namedgraph
Copy link

OK fine, WKT literals is a counter-example. But GeoSPARQL is an additional standard, not part of the SPARQL spec. And you seem to have done the same with units. So what's the problem? Why does it need to be in SPARQL 1.2 proper?

@maximelefrancois86
Copy link

I am neutral on this. I also think it would be fine to have such a datatype specified in a separate document.

@VladimirAlexiev
Copy link
Contributor Author

I don't see this becoming part of SPARQL 1.2. As I said "Adopt LINDT as a best practice" and work with other communities to adopt it.

@maximelefrancois86

  • can you give an example of complex numbers used for electrical quantities?
  • see "Broken links" above

@maximelefrancois86
Copy link

Thank you @VladimirAlexiev , we are on the same page.

For complex numbers: we just had a very good first year Master student that worked on this during her 3-months internship this year: Yana Soares de Paula. https://www.linkedin.com/in/yanaspaula/ She did an excellent job in just three months, but more work would be needed to augment cdt:ucum with complex numbers. She would probably happy to share her report with you if you wish

About the broken links in lindt v1, I'll create an issue and check asap. Thanks for the notice.

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Nov 5, 2020

There exist implementations of UCUM in common programming languages

Eg there seem to be 2 for JS:

It appears UCUM is the dominant UoM system in life sciences.

@dr-shorthair
Copy link

dr-shorthair commented Nov 6, 2020

Connecting UCUM unit symbols to ontologies like QUDT

@VladimirAlexiev Most QUDT units now have UCUM codes in their description, so correlation of these two systems is already available. (I did this work in the last few months.) The ones that are missing do not have equivalent UCUM codes, so there is nothing on that side to correlate with.

QUDT gives you explicit dimension-vectors, and conversion factors (and offsets, where appropriate).

I'm also on the UCUM Advisory Board, and the licensing issue is high on the agenda.
Though the current UCUM Terms of Use look a bit fierce at first glance, I have been assured that the kind of usage that is envisaged here is totally fine, and the intention is to make this more clear in the license.

@dr-shorthair
Copy link

dr-shorthair commented Nov 6, 2020

On the matter of style: I vote with @HolgerKnublauch in favour of

273^^ucum:K

compared with

"273 K"^^cdt:ucum

It does not require a string to be parsed, so basic SPARQL queries can be used, detecting the datatype, but without regexing strings.

@dr-shorthair
Copy link

dr-shorthair commented Nov 6, 2020

@JervenBolleman I think your table matches the UCUM codes, except for Ohm for Ω (note case). That is no surprise as UCUM was designed to use the common codes as far as possible.

This XML representation is the reference for the UCUM terminals.

@sa-bpelakh
Copy link

It does seem like the best way to adhere to standard and industry-wide use but avoid mixing domain-specific aspects into the generic SPARQL standard, this should be an auxiliary standard, like GeoSPARQL. The complex cdt:ucum literals are quite similar to WKT in this aspect.

@dr-shorthair
Copy link

Scaling factors for scalar quantities are not domain specific at all.

In fact I'd argue it is a notable failure of almost all computer languages that this is not built-in.
There are very few pure 'floating point' numbers, or 'decimals' that can be understood without knowing the unit-of-measure.

I'm totally fine with embedding coordinate sequences in a microformat, since they have no meaning considered independently. I was in the team that standardized GeoSPARQL and am very comfortable with the design choice. But scalar quantities are a very different matter, and much more simple.

@HolgerKnublauch
Copy link

I also think that units are a different topic than GeoSPARQL. Users should expect to perform comparisons using the built-in < and > operators, and possibly to do arithmetics such as + and * on unit'ed (is this a word?) values. This might of course just become a matter of enough implementations agreeing on a de-facto standard, but it shouldn't be too hard to agree on a mechanism at least for the most common units in a SPARQL 1.2. Once it's in SPARQL then related standards such as SHACL would automatically "inherit" these features, e.g. for sh:minInclusive.

@JervenBolleman
Copy link
Collaborator

@dr-shorthair I expanded my comment, changed to unit:Ohm, whose casing is inconsistent over standards.

@HolgerKnublauch I also think easier support by stores for custom datatypes would be very nice. And would make implementing this feature cheaper for everyone. Let's open a separate issue for easier custom datatypes. (Also easier sharing of custom function definitions).

@VladimirAlexiev I would love a full UCUM support for some projects I am involved in. I am just worried that it would be to large a code base for independent smaller SPARQL communities to implement. Also I think we end up with a downstream licensing issue with UCUM until their license is changed. Which might take a long time.

@kasei
Copy link
Collaborator

kasei commented Nov 6, 2020

@JervenBolleman I could see standardization of service description vocabulary terms for describing which custom datatypes are supported. Beyond that, though, wouldn't "easier custom datatypes" be an issue for individual implementations (and not something the spec can/should concern itself with)? What would spec involvement in this area look like?

@VladimirAlexiev
Copy link
Contributor Author

@dr-shorthair where do you have mapping tables QUDT-UCUM showing in particular the gaps on either side?

As I wrote above, it's useful to have in RDF (QUDT) what UCUM libraries provide in code.

Please comment on how you would represent the variety of UCUM strings (including annotations in curlies) as datatype URLs. You picked the easiest case K.

@sa-bpelakh Agreed! As I said above, this can only be a recommended best practice, can't be part of the SPARQL spec.

@HolgerKnublauch

perform comparisons using the built-in < and > operators, and possibly to do arithmetics such as + and * on unit'ed (is this a word?)

For comparison, + and - you need Commensurate quantities (having same dimensionality).
You can apply * and / to any quantities, and also between quantities and simple numbers.

LINDT does all that.

sh:minInclusive

Yes! And sh:lessThan

@JervenBolleman

too large a code base for independent smaller SPARQL communities to implement

UCUM has implementations in many languages, they should leverage such implementations.
LINDT uses UCUM Java and hooks up into Jena datatype handlers to override SPARQL operators.

@kasei
Copy link
Collaborator

kasei commented Nov 6, 2020

too large a code base for independent smaller SPARQL communities to implement

UCUM has implementations in many languages, they should leverage such implementations.
LINDT uses UCUM Java and hooks up into Jena datatype handlers to override SPARQL operators.

As an implementor of several SPARQL systems in less popular languages, I join @JervenBolleman in concern at the implementation burden. Just because something like this has implementations in several languages does not mean there wouldn't be a real cost added to many existing (and possibly future!) systems.

@ashleysommer
Copy link

ashleysommer commented Nov 6, 2020

I'm a maintainer of the Python RDFLib (including its SPARQL executor) and developer of PySHACL.

I agree with @VladimirAlexiev on this one. After reading the UCUM Spec I don't see how individual 273^^ucum:K could work for all of the possible combinations of units of measurement allowed by UCUM.

You'd need a string representation like "273 K"^^cdt:ucum, a simple example is 10 millimeters of mercury (for pressure measurement) "10 mm[Hg]"^^cdt:ucum and for a more extreme example "ventricular stroke work" in "gramforce-meter per heartbeat per square meter" "4 gf.m/({hb}.m2)"^^cdt:ucum.

While the set of units defined in UCUM is closed, the microformat is created in such a way that adding new units (in a subsequent version) is easy and predictable. If every unit in the current spec was pulled out into a discrete datatype in the ucum ontology, that would need to be updated whenever a new unit is added to UCUM.

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Nov 6, 2020

Here is the offending license clause:

Subject to Section 1 and the other restrictions hereof, users may incorporate portions of the UCUM table and definitions into another master term dictionary (e.g. laboratory test definition database), or software program for distribution outside of the user's corporation or organization, provided that any such master term dictionary or software program includes the following fields reproduced in their entirety from the UCUM table: UCUM code, definition value and unit. Every copy of the UCUM table incorporated into or distributed in conjunction with another database or software program must include the following notice:

“This product includes all or a portion of the UCUM table, UCUM codes, and UCUM definitions or is derived from it, subject to a license from Regenstrief Institute, Inc. and The UCUM Organization. Your use of the UCUM table, UCUM codes, UCUM definitions also is subject to this license, a copy of which is available at http://unisofmeasure.org. The current complete UCUM table, UCUM Specification are available for download at http://unitsofmeasure.org. The UCUM table and UCUM codes are copyright © 1995-2013, Regenstrief Institute, Inc. and the Unified Codes for Units of Measures (UCUM) Organization. All rights reserved.

THE UCUM TABLE (IN ALL FORMATS), UCUM DEFINITIONS, AND SPECIFICATION ARE PROVIDED "AS IS." ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.”

If the master term dictionary or software program containing the UCUM table, UCUM definitions and/or UCUM specification is distributed with a printed license, this statement must appear in the printed license. Where the master term dictionary or software program containing the UCUM table, UCUM definitions, and/or UCUM specification is distributed on a fixed storage medium, a text file containing this information also must be stored on the storage medium in a file called "UCUM_short_license.txt". Where the master term dictionary or software program containing the UCUM table, UCUM definitions, and/or UCUM specification is distributed via the Internet, this information must be accessible on the same Internet page from which the product is available for download.


HOWEVER, see the comment above that the UCUM committee says it's ok to use UCUM as described in this issue. I.e. not to get too hung up on this legalese (which is indeed pretty bad, compared to modern open licences)

@dr-shorthair
Copy link

dr-shorthair commented Nov 6, 2020

@VladimirAlexiev

where do you have mapping tables QUDT-UCUM showing in particular the gaps on either side?

  1. get https://github.com/qudt/qudt-public-repo/blob/master/vocab/unit/VOCAB_QUDT-UNITS-ALL-v2.1.ttl
  2. Run
SELECT *
WHERE {
	?p a qudt:Unit .
	MINUS { ?p a qudt:CurrencyUnit . }
	OPTIONAL { ?p qudt:ucumCode ?u . }
}
  1. I don't think there are any gaps on the QUDT side relative to the UCUM terminals, but since UCUM does not define a closed set (novel combinations are always possible) there will not be a member in the QUDT catalogue for every arbitrary UCUM code.

@dr-shorthair
Copy link

dr-shorthair commented Nov 6, 2020

Regarding the license, remember that (notwithstanding the use of XML for the reference data) UCUM was developed in a pre-linked-data, and pre-CC world. And since UCUM is widely built in to medical and clinical software, there was a concern to ensure that there not be any muddling of libraries and conversions. That would be a real problem. As I am in contact with both the QUDT and UCUM maintainers, clarifying the license is a priority to clear up any issues about the appearance of UCUM codes alongside a separately derived set of conversion factors. But this should definitely NOT impede mentioning UCUM codes in RDF and SPARQL.

Note that the US National Library of Medicine is now providing the main support for UCUM, including an API and a Javascript library.

@ericprud
Copy link
Member

@VladimirAlexiev

@JervenBolleman

we don't support FILTER("2+2"^^xsd:integer = 4) and that is natural

But we support FILTER("2"^^xsd:integer + 2 = 4).
What is unnatural is that we don't support "1 m"^^cdt:ucum + "100 cm"^^cdt:ucum or "1"^^ucum:m + "100"^^ucum:cm.

I think a more apt analogy for "1"^^ucum:m + "100"^^ucum:cm would be FILTER(2 + 2.0 = 4). The fact that "2"^^xsd:integer parses to the same internal representation as 2 is just an feature of the parser semantics. The ability to add a double and an integer and compare the result to an integer (in fact, the comparison substitutes the double 4.0) is orchestrated by XPath's numeric type promotion and type substitution. Extrapolating that to apply to units would give us that same functionality and some nice unit analysis as a side benefit. I can see a couple ways to do that:

Canonical units

For every dimension we specify (length, charge, mass...), pick a canonical unit. MKS would be practical and would add another attractor tugging the US forward to the 18th century). Enumerate all of the compatible units with linear functions mapping them to the canonical:
ucum:m -> +0, *1 ucum:m
ucum:in -> +0, *.0254 ucum:m
ucum:f -> -32, *1.8 ucum:c

Any evaluation requiring the promotion of the left column to the right column applies the transformation and leaves you with the canonical units. Where the current operator table has entries like

Operator Type(A) Type(B) Function Result type
A + B numeric numeric op:numeric-add(A, B) numeric

we could add entries for the dimensions:

Operator Type(A) Type(B) Function Result type
A + B length length op:numeric-add(A, B) length

This is cool because the operator table prevents us from adding a length to a time. It's a little funny because everything gets metrified, e.g. (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you ".3302"^^ucum:m.

Unit ladder

We could ameliorate that a bit by group entries in the type promotion hierarchy so that known imperial units stay imperial and get promoted to the smallest imperial unit, so (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you "13"^^ucum:in. Things that don't fit into one of those groups would still get metrified (yes, i made that word up), e.g. (BIND "1"^^ucum:lightyear + "1"^^ucum:parsec AS ?x) will give you "4.0318165349E16"^^ucum:m.

P.S.

It would be lovely extend the grammar so we could write 1ft instead of "1"^^ucum:foot (which as a parser feature, is orthogonal to the "1"^^ucum:foot vs. "1ft"^^ucum:length debate. I guess feasibility comes down to how crazy the lexical strings for the units are.

@sa-bpelakh
Copy link

@VladimirAlexiev

Canonical units

I like the design for canonical units, and the implementation is well defined. I definitely prefer "1"^^ucum:foot instead of "1ft"^^ucum:length, because the unit implies the dimension, and avoids a micro-grammar in the literal value.

I think the complexity of the unit ladder could be avoided if you allow casting conversions, e.g. bind(ucum:foot(?a + ?b +?c) as ?length_in_feet)) to guarantee a specific unit (and do dimension checking in the process)

@kasei
Copy link
Collaborator

kasei commented Nov 20, 2020

@ericprud

It's a little funny because everything gets metrified, e.g. (BIND "1"^^ucum:ft + "1"^^ucum:in AS ?x) will give you ".3302"^^ucum:m.

I would think this could be handled just like the XPath constructor functions:

ucum:in("1"^^ucum:ft + "1"^^ucum:in) => "13"^^ucum:in

(Though there might be some funny floating point error issues to consider.)

@ericprud
Copy link
Member

@kasei , that makes sense to me. I think your example converts 1ft and 1in both to meters and then back to inches. If you knew the types (e.g. they weren't plucked from some heterogeneous attribute in the data), you could avoid that by narrowing the scope of the cast:

ucum:in("1"^^ucum:ft) + "1"^^ucum:in => "13"^^ucum:in

I mention this because an alternative would be that casting functions override the type promotion of their arguments but applying the cast to each of the contained atoms. This sounds terribly contrived but useful enough to have a moment of collective consideration.

@dr-shorthair
Copy link

All the libraries or catalogues that I've looked at record conversion factors to SI.
So any comparison of non-SI scaled quantities would necessarily trip through a conversion to SI.

@ericprud
Copy link
Member

i think we can drop this notion of having a clairvoyant cast function that operates over the operands of any nested operators.

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Nov 28, 2020

@ericprud and @sa-bpelakh

  • For clarity: LINDT has datatypes cdt:ucum and per-kind units like cdt:length, but not per-unti datatypes like ucum:m
  • LINDT does arithmetic operations and comparisons, and no other library does that
  • feasibility comes down to how crazy the lexical strings for the units are: some are "crazy" indeed!

Several people have proposed to use per-kind units like "1"^^ucum:m instead of LIND's approach eg "1 m"^^cdt:ucum. But nobody has yet proposed how to handle the variety of "crazy" units.

  • UCUM defines a countably infinite list of units. Any RDF approach is necessarily finite.
  • UCUM us more dynamic. Eg "fuel flow" in USAF/NASA is sometimes measured in pounds per hour. With UCUM I can write "10 [lb_av]/h" right away, whereas with QUDT I have to propose a new unit new: unit:LB-PER-HR qudt/qudt-public-repo#285. But I can live with that.
  • However, I cannot live with having to use URL escapes in datatype URLs
  • And I think the medical community needs their "unit annotations", eg {rbc}

@dr-shorthair
Copy link

dr-shorthair commented Nov 29, 2020

I understand your concern about URL escapes. The following 'reserved' characters may appear in UCUM codes:

* ' ( ) + / [ ]

Of these, [ ] are commonly used and can't be easily worked around.
' appear in the codes for minutes and seconds, and in some qualified units like [in_i'H2O].
Parentheses ( ) can be used to group codes 'under the solidus' /, both of which can be avoided by using dots and negative exponents. + is only necessary for some power-of-ten factors.

I don't believe { } are reserved.

I think QUDT is a separate issue at this point. Yes, it may be useful as it provides an RDF-based model for describing units. But I would expect that it would be invoked through a call like
give me the QUDT description of the UOM with the UCUM symbol AAAaaaAAA
or similar. The UCUM symbol is the key.

@HolgerKnublauch
Copy link

Vladimir, I still don't have strong opinion against string-encoding. And I do agree that this flexible string encoding has some advantages, because it is more open-ended than having URIs and, as you point out, URL escapes can be ugly.

I do wonder though whether those complex compound units are important enough and whether they should dictate how the rest of the solution should work. Arguably the vast majority of use cases will be covered by a static set of predictable and well-established URIs for the commonly used units. Much will be gained if there is at least a solution for those. As long as there is a generic machinery to get from a Unit URI to the base units, conversion factors etc, even a URI mechanism would cover the more unusual cases. If units are URIs then these resources can hold additional metadata for this effect.

@maximelefrancois86
Copy link

Dear all,

The BIPM (Bureau International des Poids et des Mesures - the intergovernmental organization through which Member States act together on matters related to measurement science and measurement standards) is organizing an on-line workshop Feb.
22-26 2021: The International System of Units (SI) in FAIR digital data

https://www.bipm.org/en/conference-centre/bipm-workshops/digital-si/

See a Draft - Grand Vision: Transforming the International System of Units for a Digital World

I was invited to present there. I aim to summarize the different approaches that have been discussed in the W3C groups I was involved, and other approaches I am aware of in the SemWeb community, with the identified pros/cons

You are welcome to attend this workshop too, the pre-registration form is here: https://form.jotform.com/BIPM/Workshop-SI-2021

@VladimirAlexiev
Copy link
Contributor Author

@HolgerKnublauch and @maximelefrancois86 and @dr-shorthair I think we need both belt and suspenders:

  • LINDT for the speed, convenience and infinite on-demand extensibility
  • QUDT for the metadata: descriptions, dimensionality, scientific disciplines, and cross-links to other ontologies

BTW I'm now dealing with IEC and eClass units

@dr-shorthair
Copy link

QUDT has links to some of these

Note that QUDT is now quite responsive to requests and bug reports, information supplementation etc.
Log an issue here - https://github.com/qudt/qudt-public-repo/issues
Better still: fork and make a PR.

@maximelefrancois86
Copy link

As a matter of fact, theoretically in https://ucum.org/ucum.html

From 2.1§3■1 UCUM atom characters are in the ASCII range 33-126, minus a few characters. The following UCUM atom characters are forbidden in IRIs: <>|^`\ or need to be escaped in IRI local names: ~!$&'*,;?#@%_

From 2.1§6■1 UCUM characters for annotation { } are forbidden characters for IRIs

From 2.1§7■1 characters for operators . / need to be escaped in IRI local names

So encoding UCUM units in datatype IRIs, one would end up:

  • forbidding UCUM unit annotations
  • escaping many characters in IRI local names

I understand your concern about URL escapes. The following 'reserved' characters may appear in UCUM codes:

* ' ( ) + / [ ]

Of these, [ ] are commonly used and can't be easily worked around.
' appear in the codes for minutes and seconds, and in some qualified units like [in_i'H2O].
Parentheses ( ) can be used to group codes 'under the solidus' /, both of which can be avoided by using dots and negative exponents. + is only necessary for some power-of-ten factors.

I don't believe { } are reserved.

@nichtich
Copy link
Contributor

nichtich commented Aug 1, 2024

I can assure that real world RDF data with units of measure happens to be given in at least these forms (with varying namespaces and ontologies):

@prefix cdt: <https://w3id.org/cdt/>
@prefix om: <http://www.ontology-of-units-of-measure.org/resource/om-2/>

# 1. Custom plain string without any reference (most common, good luck)
_:x my:weight "10 KiloGram" . 

# 2. Reference to a standard notation such as UCUM (better)
_:x my:weight "10 kg"^^cdt:ucum .

# 3. Value and data type from some standard vocabulary, e.g. OM (UCUM in RDF)
_:x my:weight "10"^om:kilogram . 

# 4. Measurement node with some custom or standard vocabulary
_:x my:weight [
  my:value 10 ;
  my:unit om:kilogram
]

The last form has several variants. Here is an actual example from practice using CRM ontology (slightly simplified, it's even more complex!):

@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .

_:m crm:P39 measured _:x .
_:m [
  a crm:E16_Measurement ;
  crm:P40_observed_dimension [
    a crm:E54_Dimension ;
    crm:P90_has_value: 2.8 ;
    crm:P91_has_unit [ # this would map to an existing unit URI such as om:centimetre
      a crm:E58_Measurement_Unit ;
      crm:P3_has_note "cm"  
  ] ;
  crm:P2_has_type [ # this would need to another vocabulary with definition of "height"
     crm:P3_has_note "Höhe"  
  ]
]

To handle and clean up this ways to model data with units of measure I'd stick to:

  1. a standard to write down measures in string form and a corresponding RDF data type: UCUM and cdt:ucum looks good!

  2. URIs for units of measurement such as kg, cm...: some have already been proposed and people will not stop creating new URIs and their own ontologies and lists for units with their own use cases. Any approach to collect all units in one single ontology is futile.

  3. An ontology to link units of measurement, e.g. to state that a unit my:RomanMile is 5000 times another unit my:RomanFeet. SPARQL does not need to know about actual units, just about how to process their conversion factors.

@dr-shorthair
Copy link

dr-shorthair commented Aug 2, 2024

IMAO we should encourage use of pattern 3. as it provides the required information in the most usable form

  1. Value and data type from some standard vocabulary, e.g. OM (UCUM in RDF)
    _:x my:weight "10"^^om:kilogram .

Unlike patterns 1. and 2. this does not use a microformat in which a literal must be parsed and broken up into multiple items. Pattern 3. can be processed by un-modified and unsupplemented RDF libraries.

And unlike pattern 4. it does not bury a scalar inside a data structure.

Yes, pattern 3. hands off interpretation of the scale to another service, but all the proposed options appear to do that anyway.

@kasei
Copy link
Collaborator

kasei commented Aug 3, 2024

IMAO we should encourage use of pattern 3. as it provides the required information in the most usable form

I think "most usable" is going to be use-case dependent here. The CRM modeling is the way it is for reasons important to cultural heritage use-cases. The very verbose modeling here stems mostly from using an upper ontology that can be used to address diverse use-cases (e.g. the units and/or type of value such as "weight of 10kg" are not fixed or prescribed by the ontology), and allows metadata to be added to almost any part of the data (e.g. provenance data that preserves the exact lexical form of the value that might differ from a normalized numeric value; or adding a citation to exactly where a dimension value came from). FWIW, RDF 1.2 (RDF-star) may provide some new options to address these modeling needs.

Additionally, the CRM modeling has the advantage that it actually uses numeric values that will sort naturally in SPARQL (and use optimized storage and retrieval in many systems) without any runtime casting or conversion. Encouraging best practices can be good, but to maintain these benefits you'd have to go beyond best practices and ensure LINDT datatypes were officially supported by SPARQL and underlying stores.

@TallTed
Copy link
Member

TallTed commented Aug 5, 2024

Additionally, the CRM modeling has the advantage that it actually uses numeric values that will sort naturally in SPARQL (and use optimized storage and retrieval in many systems) without any runtime casting or conversion.

Of course, numeric values for mass of 1 kg, mass of 0.997 kg, and mass of 999 g, all of which are valid, will not sort as desired, unless all mass values are converted (or forced) to kg or g.

(quibble: 10kg is a measure of mass, not weight, and is the same for the same object whether it's measured on Earth or the Moon. 10lbs is a measure of weight, not mass, and differs for the same object depending on whether it's measured on Earth or the Moon.)

@ericprud
Copy link
Member

ericprud commented Aug 5, 2024

  • UCUM defines a countably infinite list of units. Any RDF approach is necessarily finite.

I don't know that it does have to be finite. What happens if we take UCUM verbatim and simply accept that there can be an infinite expression of datatypes just as there can be an infinite expression of values that they describe.

As a thought experiment, a more self-describing "datatype namespace" could define something like (borrowing from @TallTed's quibble):

# for some reason lbf is tied to Avoirdupois. whatever
"10"^^kind_n_type:massXdistanceYtimeYtime_lbf-av

@kasei
Copy link
Collaborator

kasei commented Aug 5, 2024

Of course, numeric values for mass of 1 kg, mass of 0.997 kg, and mass of 999 g, all of which are valid, will not sort as desired, unless all massvalues are converted (or forced) tokgorg`.

Right. In the CIDOC case, you'd likely be restricting the query to a specific unit in the graph pattern, or be casting values with arbitrary units to a known unit via SPARQL extension function (or client-side, which has it's own set of challenges). I think that's somewhat orthogonal to the storage-level advantages of having real numeric types, but again this might be use-case dependent. FWIW, I think the Wikidata modeling has some similarities here, in that you can restrict to known units in the graph pattern by using the psn predicates for normalized values, and then on to a real quantityAmount numeric value.

@nichtich
Copy link
Contributor

nichtich commented Aug 6, 2024

@kasei thanks for mentioning Wikidata. Its model of units of measures is documented with SPARQL queries here. The list of supported quantities is configured in a table but this table could be given in RDF with a (hopefully more simple) subset of QUDT Units Vocabulary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests