-
Notifications
You must be signed in to change notification settings - Fork 2
Pascal's 2¢
These are just some unorganized thoughts I had while working on the fringes of the codebase doing minor refactoring. They are also heavily biased by my work on PhEx.
I think about this project as a language application1 for the CQL language. Specifically, we are building a language translator from the CQL language to OHDSI Circe2.
We have the benefit of using pieces of the CQL implementation, so we don't need to generate a parser or design an AST, because we can already use the CQFramework libraries for that. We have tools to parse high level CQL code, and there is already a specification for the AST (ELM). We also have access to the Circe library, which provides domain objects for representing OHDSI cohort criteria.
Essentially, all we need to do is recursively walk the AST (ELM) and translate each node into the corresponding Circe criteria where possible. This is a significant over simplification, as there isn't a one to one mapping between the ELM and Circe. We will inevitably need to do some AST pattern matching, because multi-node subtrees of the ELM likely map to a single Circe construct. While we are walking the tree, we will probably also need to keep a global Environment
to keep track of symbols and other global (and potentially also scoped) information.
A more complicated, and also more common in the Java/ANTLR world, is to build a visitor to use for tree walking. We could consider this approach, but since our pipeline of operations on the AST is quite short, this might be an overkill. However, one possible benefit of this approach is that we can use ANTLR to generate some of this code for us.
Initially, we implemented the translator as an application with a static main
function. This absolutely made sense at the time, but introduced some complexities when trying to use the translator as a library in PhEx. In my opinion, we should build the translator as a library that can be used as a regular Maven dependency, with a well-designed modular API. I've done some refactoring in this direction, but I think it would be good to decide on who we want the library consumers to be, and what we want our public API to look like in the short and medium term.
That said, I still think it makes sense to have static main
function to enable running the translator as a cli application. The main
function should just make use of the library, and the library shouldn't make any assumptions about how or where it is running (e.g. shouldn't assume a static context).
At the moment we are manually constructing JSON strings using Java String
functions in several places. This is error prone, difficult to maintain, and makes working with Circe objects more difficult than necessary. In my opinion, we should use existing libraries like Jackson to (optionally) do serialization at the boundaries of our library. In fact, the Circe domain objects already have Jackson annotations (e.g. Observation
).
Unfortunately, it seems that the OHDSI WebAPI expects stringified JSON payloads in some instances, but this is still possible to implement using Jackson custom serialization (or possibly a special annotation, I haven't checked).
I don't yet have a detailed understanding of how cohort definitions are represented in Circe. Specifically, I have not finished reading the book chapter3 and I haven't looked at the code2 in detail yet. Therefore, there is a chance that everything I said above could be totally invalid. If so, my apologies.
As a result of my incomplete understanding, I don't have a good idea of how CQL libraries should map to OHDSI cohort entry and exit events. Hopefully once I have better knowledge of the Circe conceptual model (see disclaimer), I'll have some ideas on this.
EDIT: Another important question that just came to mind is how we map the data model specified in the library to the OMOP data model. Which data models do we want to support?
- Circe docs
- Poster about converting PheKB phenotypes to Circe [abstract] [poster]
- Paper about the effect of vocabulary mapping for conditions on phenotype cohorts
- Circe discussion on OHDSI forums