-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LINDT units of measure #129
Comments
I have been thinking about this exact thing. However, I was thinking that all of UCUM might be to much work to demand, plus currently awkward licensing wise (section 7 of the license). So I was wondering if the 7 base Units of the International Systems of Units plus the 21 coherent derived named units might be a sufficient lower bound for implementation.
Implementation advantage is these are simple numeric values so as long as datatype is the same cast to decimal and compare. There are no prefixes, all must be converted to the base unit. e.g. 600km should be stored as "60000"^^unit:m. The list misses Celsius, which is trivial to add, but would need to be comparable to Kelvin. Basically, the cost is in converting to base units at storage. However, consistent storage makes it easier to build indexes and ensure query results are correct. This of course leaves out all the derived units e.g. kg/m m/s^2 etc. which I think would be very worth while to have but might be to much work to implement all commonly used ones. Unless they follow a straightforward pattern in IRI encoding that can be decoded and generated by stores on the fly. e.g. an option division "1"^^unit:kg / "1"^^unit:m => "1"^^unit:kg-per-m and "2"^^unit:m * "2"^^unit:m => "4"^^unit:m2 Also the use of data types avoids one of the issues with UCUM is that conflicts in coding exists, and that coding for customary units is not straightforward. See Fluid Ounce. Where a datatype pointing to the fluid ounce definition would be a clearer option. Side notes
|
@JervenBolleman what licensing problems do you see? UCUM has customary units that are important in many disciplines. If you ask users to always store angstroms and light-years as meters, you are shifting burden to them. And even "decorative" units like LINDT is implemented in Jena, and @jeenbroekstra says won't be too hard to port to rdf4j. So "Downgrading" to your proposal will be more work for java devs. So that leaves other languages. The suggestion is whether there are UCUM libraries in other languages? |
I much prefer using explicit datatypes to encoding the unit into the string. E.g. "1"^^unit:m is better than "1 m"^^ucum:unit. But a more general solution might be to offer a declarative extension point so that anyone can define custom datatypes and those datatypes actually can be used consistently. This could work similar to user-defined SHACL constraint components or SHACL-AF functions. Just some very quick thoughts, a datatype might need to be able to respond to questions "can I compare my value to another datatype" (e.g. yes for mm to m comparison), and then a normalize function that would bring all datatypes from a group to a common base unit, e.g. meter. Then things like < comparison in SPARQL can be automated. The actual business logic can probably be covered declaratively through a couple of properties that are attached to the units as done (comprehensively) in the QUDT vocabulary. The advantage here is that a SPARQL 1.2 would only need to implement a few generic building blocks while the details of the specific datatypes are irrelevant, and we don't even need to discuss the specific catalog of datatypes that need to be implemented. |
Connecting UCUM unit symbols to ontologies like QUDT or OM is important and useful because they expose as triples info that is within the UCUM library (eg the dimension vector if Newton and the conversion factors of Farenheit). Plus extra info, eg that Inch is an imperial unit, grouping of units by discipline, etc. I believe QUDT or OM already has ucum codes, so that should not be hard. So we could spec custom functions to parse out a unit from a quantity, and connect to a structured unit node in such an ontology. |
In addition to the generic cdt:ucum, LINDT also has datatypes ucum:length, ucum:mass etc that represent quantities with fixed/known dimension. But I see some problems with having distinct datatypes for each unit:
LINDT does that:
BTW @maximelefrancois86 there are Broken links at https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes:
In contrast, both of https://ci.mines-stetienne.fr/lindt/v1/custom_datatypes and https://ci.mines-stetienne.fr/lindt/v3/custom_datatypes exist. |
|
@namedgraph This unit is well structured and well defined according to https://ucum.org/ucum.html (and implemented in Java UCUM and consequently in LINDT). NOT everything needs or should be structured in RDF. Eg are you against GeoSPARQL literals (WKT and GML)? |
Dear all, @namedgraph , there are sometimes good rationale to encode complex values using literals instead of relying on RDF structures and basic datatypes. The OGC GeoSPARQL datatype
I back up the thoughts of @VladimirAlexiev :
|
OK fine, WKT literals is a counter-example. But GeoSPARQL is an additional standard, not part of the SPARQL spec. And you seem to have done the same with units. So what's the problem? Why does it need to be in SPARQL 1.2 proper? |
I am neutral on this. I also think it would be fine to have such a datatype specified in a separate document. |
I don't see this becoming part of SPARQL 1.2. As I said "Adopt LINDT as a best practice" and work with other communities to adopt it.
|
Thank you @VladimirAlexiev , we are on the same page. For complex numbers: we just had a very good first year Master student that worked on this during her 3-months internship this year: Yana Soares de Paula. https://www.linkedin.com/in/yanaspaula/ She did an excellent job in just three months, but more work would be needed to augment About the broken links in lindt v1, I'll create an issue and check asap. Thanks for the notice. |
Eg there seem to be 2 for JS:
It appears UCUM is the dominant UoM system in life sciences.
|
See also
Maybe there are more |
@VladimirAlexiev Most QUDT units now have UCUM codes in their description, so correlation of these two systems is already available. (I did this work in the last few months.) The ones that are missing do not have equivalent UCUM codes, so there is nothing on that side to correlate with. QUDT gives you explicit dimension-vectors, and conversion factors (and offsets, where appropriate). I'm also on the UCUM Advisory Board, and the licensing issue is high on the agenda. |
On the matter of style: I vote with @HolgerKnublauch in favour of
compared with
It does not require a string to be parsed, so basic SPARQL queries can be used, detecting the datatype, but without regexing strings. |
@JervenBolleman I think your table matches the UCUM codes, except for This XML representation is the reference for the UCUM terminals. |
It does seem like the best way to adhere to standard and industry-wide use but avoid mixing domain-specific aspects into the generic SPARQL standard, this should be an auxiliary standard, like GeoSPARQL. The complex |
Scaling factors for scalar quantities are not domain specific at all. In fact I'd argue it is a notable failure of almost all computer languages that this is not built-in. I'm totally fine with embedding coordinate sequences in a microformat, since they have no meaning considered independently. I was in the team that standardized GeoSPARQL and am very comfortable with the design choice. But scalar quantities are a very different matter, and much more simple. |
I also think that units are a different topic than GeoSPARQL. Users should expect to perform comparisons using the built-in < and > operators, and possibly to do arithmetics such as + and * on unit'ed (is this a word?) values. This might of course just become a matter of enough implementations agreeing on a de-facto standard, but it shouldn't be too hard to agree on a mechanism at least for the most common units in a SPARQL 1.2. Once it's in SPARQL then related standards such as SHACL would automatically "inherit" these features, e.g. for sh:minInclusive. |
@dr-shorthair I expanded my comment, changed to unit:Ohm, whose casing is inconsistent over standards. @HolgerKnublauch I also think easier support by stores for custom datatypes would be very nice. And would make implementing this feature cheaper for everyone. Let's open a separate issue for easier custom datatypes. (Also easier sharing of custom function definitions). @VladimirAlexiev I would love a full UCUM support for some projects I am involved in. I am just worried that it would be to large a code base for independent smaller SPARQL communities to implement. Also I think we end up with a downstream licensing issue with UCUM until their license is changed. Which might take a long time. |
@JervenBolleman I could see standardization of service description vocabulary terms for describing which custom datatypes are supported. Beyond that, though, wouldn't "easier custom datatypes" be an issue for individual implementations (and not something the spec can/should concern itself with)? What would spec involvement in this area look like? |
@dr-shorthair where do you have mapping tables QUDT-UCUM showing in particular the gaps on either side? As I wrote above, it's useful to have in RDF (QUDT) what UCUM libraries provide in code. Please comment on how you would represent the variety of UCUM strings (including annotations in curlies) as datatype URLs. You picked the easiest case @sa-bpelakh Agreed! As I said above, this can only be a recommended best practice, can't be part of the SPARQL spec.
For comparison, + and - you need Commensurate quantities (having same dimensionality). LINDT does all that.
Yes! And sh:lessThan
UCUM has implementations in many languages, they should leverage such implementations. |
As an implementor of several SPARQL systems in less popular languages, I join @JervenBolleman in concern at the implementation burden. Just because something like this has implementations in several languages does not mean there wouldn't be a real cost added to many existing (and possibly future!) systems. |
I'm a maintainer of the Python RDFLib (including its SPARQL executor) and developer of PySHACL. I agree with @VladimirAlexiev on this one. After reading the UCUM Spec I don't see how individual You'd need a string representation like While the set of units defined in UCUM is closed, the microformat is created in such a way that adding new units (in a subsequent version) is easy and predictable. If every unit in the current spec was pulled out into a discrete datatype in the ucum ontology, that would need to be updated whenever a new unit is added to UCUM. |
Here is the offending license clause: Subject to Section 1 and the other restrictions hereof, users may incorporate portions of the UCUM table and definitions into another master term dictionary (e.g. laboratory test definition database), or software program for distribution outside of the user's corporation or organization, provided that any such master term dictionary or software program includes the following fields reproduced in their entirety from the UCUM table: UCUM code, definition value and unit. Every copy of the UCUM table incorporated into or distributed in conjunction with another database or software program must include the following notice: “This product includes all or a portion of the UCUM table, UCUM codes, and UCUM definitions or is derived from it, subject to a license from Regenstrief Institute, Inc. and The UCUM Organization. Your use of the UCUM table, UCUM codes, UCUM definitions also is subject to this license, a copy of which is available at http://unisofmeasure.org. The current complete UCUM table, UCUM Specification are available for download at http://unitsofmeasure.org. The UCUM table and UCUM codes are copyright © 1995-2013, Regenstrief Institute, Inc. and the Unified Codes for Units of Measures (UCUM) Organization. All rights reserved. THE UCUM TABLE (IN ALL FORMATS), UCUM DEFINITIONS, AND SPECIFICATION ARE PROVIDED "AS IS." ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.” If the master term dictionary or software program containing the UCUM table, UCUM definitions and/or UCUM specification is distributed with a printed license, this statement must appear in the printed license. Where the master term dictionary or software program containing the UCUM table, UCUM definitions, and/or UCUM specification is distributed on a fixed storage medium, a text file containing this information also must be stored on the storage medium in a file called "UCUM_short_license.txt". Where the master term dictionary or software program containing the UCUM table, UCUM definitions, and/or UCUM specification is distributed via the Internet, this information must be accessible on the same Internet page from which the product is available for download. HOWEVER, see the comment above that the UCUM committee says it's ok to use UCUM as described in this issue. I.e. not to get too hung up on this legalese (which is indeed pretty bad, compared to modern open licences) |
SELECT *
WHERE {
?p a qudt:Unit .
MINUS { ?p a qudt:CurrencyUnit . }
OPTIONAL { ?p qudt:ucumCode ?u . }
}
|
Regarding the license, remember that (notwithstanding the use of XML for the reference data) UCUM was developed in a pre-linked-data, and pre-CC world. And since UCUM is widely built in to medical and clinical software, there was a concern to ensure that there not be any muddling of libraries and conversions. That would be a real problem. As I am in contact with both the QUDT and UCUM maintainers, clarifying the license is a priority to clear up any issues about the appearance of UCUM codes alongside a separately derived set of conversion factors. But this should definitely NOT impede mentioning UCUM codes in RDF and SPARQL. Note that the US National Library of Medicine is now providing the main support for UCUM, including an API and a Javascript library. |
I think a more apt analogy for Canonical unitsFor every dimension we specify (length, charge, mass...), pick a canonical unit. MKS would be practical and would add another attractor tugging the US forward to the 18th century). Enumerate all of the compatible units with linear functions mapping them to the canonical: Any evaluation requiring the promotion of the left column to the right column applies the transformation and leaves you with the canonical units. Where the current operator table has entries like
we could add entries for the dimensions:
This is cool because the operator table prevents us from adding a length to a time. It's a little funny because everything gets metrified, e.g. Unit ladderWe could ameliorate that a bit by group entries in the type promotion hierarchy so that known imperial units stay imperial and get promoted to the smallest imperial unit, so P.S.It would be lovely extend the grammar so we could write |
I like the design for canonical units, and the implementation is well defined. I definitely prefer I think the complexity of the unit ladder could be avoided if you allow casting conversions, e.g. |
I would think this could be handled just like the XPath constructor functions:
(Though there might be some funny floating point error issues to consider.) |
@kasei , that makes sense to me. I think your example converts 1ft and 1in both to meters and then back to inches. If you knew the types (e.g. they weren't plucked from some heterogeneous attribute in the data), you could avoid that by narrowing the scope of the cast:
I mention this because an alternative would be that casting functions override the type promotion of their arguments but applying the cast to each of the contained atoms. This sounds terribly contrived but useful enough to have a moment of collective consideration. |
All the libraries or catalogues that I've looked at record conversion factors to SI. |
i think we can drop this notion of having a clairvoyant cast function that operates over the operands of any nested operators. |
@ericprud and @sa-bpelakh
Several people have proposed to use per-kind units like
|
I understand your concern about URL escapes. The following 'reserved' characters may appear in UCUM codes:
Of these, I don't believe I think QUDT is a separate issue at this point. Yes, it may be useful as it provides an RDF-based model for describing units. But I would expect that it would be invoked through a call like |
Vladimir, I still don't have strong opinion against string-encoding. And I do agree that this flexible string encoding has some advantages, because it is more open-ended than having URIs and, as you point out, URL escapes can be ugly. I do wonder though whether those complex compound units are important enough and whether they should dictate how the rest of the solution should work. Arguably the vast majority of use cases will be covered by a static set of predictable and well-established URIs for the commonly used units. Much will be gained if there is at least a solution for those. As long as there is a generic machinery to get from a Unit URI to the base units, conversion factors etc, even a URI mechanism would cover the more unusual cases. If units are URIs then these resources can hold additional metadata for this effect. |
Dear all, The BIPM (Bureau International des Poids et des Mesures - the intergovernmental organization through which Member States act together on matters related to measurement science and measurement standards) is organizing an on-line workshop Feb. https://www.bipm.org/en/conference-centre/bipm-workshops/digital-si/ See a Draft - Grand Vision: Transforming the International System of Units for a Digital World I was invited to present there. I aim to summarize the different approaches that have been discussed in the W3C groups I was involved, and other approaches I am aware of in the SemWeb community, with the identified pros/cons You are welcome to attend this workshop too, the pre-registration form is here: https://form.jotform.com/BIPM/Workshop-SI-2021 |
@HolgerKnublauch and @maximelefrancois86 and @dr-shorthair I think we need both belt and suspenders:
BTW I'm now dealing with IEC and eClass units
|
Note that QUDT is now quite responsive to requests and bug reports, information supplementation etc. |
As a matter of fact, theoretically in https://ucum.org/ucum.html From 2.1§3■1 UCUM atom characters are in the ASCII range 33-126, minus a few characters. The following UCUM atom characters are forbidden in IRIs: <>|^`\ or need to be escaped in IRI local names: ~!$&'*,;?#@%_ From 2.1§6■1 UCUM characters for annotation { } are forbidden characters for IRIs From 2.1§7■1 characters for operators . / need to be escaped in IRI local names So encoding UCUM units in datatype IRIs, one would end up:
|
I can assure that real world RDF data with units of measure happens to be given in at least these forms (with varying namespaces and ontologies): @prefix cdt: <https://w3id.org/cdt/>
@prefix om: <http://www.ontology-of-units-of-measure.org/resource/om-2/>
# 1. Custom plain string without any reference (most common, good luck)
_:x my:weight "10 KiloGram" .
# 2. Reference to a standard notation such as UCUM (better)
_:x my:weight "10 kg"^^cdt:ucum .
# 3. Value and data type from some standard vocabulary, e.g. OM (UCUM in RDF)
_:x my:weight "10"^om:kilogram .
# 4. Measurement node with some custom or standard vocabulary
_:x my:weight [
my:value 10 ;
my:unit om:kilogram
] The last form has several variants. Here is an actual example from practice using CRM ontology (slightly simplified, it's even more complex!):
To handle and clean up this ways to model data with units of measure I'd stick to:
|
IMAO we should encourage use of pattern 3. as it provides the required information in the most usable form
Unlike patterns 1. and 2. this does not use a microformat in which a literal must be parsed and broken up into multiple items. Pattern 3. can be processed by un-modified and unsupplemented RDF libraries. And unlike pattern 4. it does not bury a scalar inside a data structure. Yes, pattern 3. hands off interpretation of the scale to another service, but all the proposed options appear to do that anyway. |
I think "most usable" is going to be use-case dependent here. The CRM modeling is the way it is for reasons important to cultural heritage use-cases. The very verbose modeling here stems mostly from using an upper ontology that can be used to address diverse use-cases (e.g. the units and/or type of value such as "weight of 10kg" are not fixed or prescribed by the ontology), and allows metadata to be added to almost any part of the data (e.g. provenance data that preserves the exact lexical form of the value that might differ from a normalized numeric value; or adding a citation to exactly where a dimension value came from). FWIW, RDF 1.2 (RDF-star) may provide some new options to address these modeling needs. Additionally, the CRM modeling has the advantage that it actually uses numeric values that will sort naturally in SPARQL (and use optimized storage and retrieval in many systems) without any runtime casting or conversion. Encouraging best practices can be good, but to maintain these benefits you'd have to go beyond best practices and ensure LINDT datatypes were officially supported by SPARQL and underlying stores. |
Of course, numeric values for (quibble: 10kg is a measure of mass, not weight, and is the same for the same object whether it's measured on Earth or the Moon. 10lbs is a measure of weight, not mass, and differs for the same object depending on whether it's measured on Earth or the Moon.) |
I don't know that it does have to be finite. What happens if we take UCUM verbatim and simply accept that there can be an infinite expression of datatypes just as there can be an infinite expression of values that they describe. As a thought experiment, a more self-describing "datatype namespace" could define something like (borrowing from @TallTed's quibble): # for some reason lbf is tied to Avoirdupois. whatever
"10"^^kind_n_type:massXdistanceYtimeYtime_lbf-av |
Right. In the CIDOC case, you'd likely be restricting the query to a specific unit in the graph pattern, or be casting values with arbitrary units to a known unit via SPARQL extension function (or client-side, which has it's own set of challenges). I think that's somewhat orthogonal to the storage-level advantages of having real numeric types, but again this might be use-case dependent. FWIW, I think the Wikidata modeling has some similarities here, in that you can restrict to known units in the graph pattern by using the |
@kasei thanks for mentioning Wikidata. Its model of units of measures is documented with SPARQL queries here. The list of supported quantities is configured in a table but this table could be given in RDF with a (hopefully more simple) subset of QUDT Units Vocabulary. |
Why?
It's hard to work with quantities (value + UoM) in RDF and SPARQL.
There are about 10 UoM ontologies:
L^1
, area isL^2
) and conversion factors (eg fromcm
tom
, fromdegF
todegC
)Working with units in SPARQL is quite hard. Comparing compatible units or doing arithmetics on units is possible if you are working with one of the better ontologies, but difficult. You have to fetch the dimension vectors and conversion factors and work with them, and the queries become very complex.
SHACL's modest arithmetic capabilities (eg minInclusive to compare to constant, lessThan to compare two props) borrow from SPARQL, so it's impossible to state "temperature should be between 0 and 10 degC", see https://lists.w3.org/Archives/Public/public-shacl/2020Nov/0001.html
But there is one approach that solves these problems.
Previous work
LINDT is unique in that it encodes both value and unit in one literal, eg
"1 m"^^cdt:ucum
,"100 cm"^^cdt:ucum
. This is economical, but more importantly you can compare such quantities, and you can also do arithmetic operations on quantities.This would be very useful for any sort of application in engineering, smart cities, semantic sensor networks, WoT, etc.
Features https://ci.mines-stetienne.fr/lindt/v2/custom_datatypes.html#on-apache-jena
xsd:int, xsd:decimal, xsd:float, xsd:double
)lindt:sameDimension(arg1,arg2)
to check if two measurement literals are commensurable (returns axsd:boolean
).LINDT is very ingenious and it's a pity that it hasn't found a wider following.
Proposed solution
Adopt LINDT as a best practice for representing units.
Work with other communities (WoT, semantic sensors) to also adopt it.
Considerations for backward compatibility
No direct consequences because it uses custom datatype handlers to do its work. I.e. if you don't use the CDT datatypes (
cdt:ucum, cdt:length
, etc) you'll see no difference.However, guidance and solution templates for migrating from other systems for representing units should be provided
The text was updated successfully, but these errors were encountered: