New Epic: Idempotence #768

jmcook1186 · 2024-06-03T15:01:04Z

jmcook1186
Jun 3, 2024
Maintainer

Here's a new epic we want to work on int he next few weeks. laying out the early thinkiong here to you can all comment and help us refine before it gets worked up into tickets in our dev sprints.

Background and problem statement**

A command is idempotent if the result is identical regardless of how many times the command is executed. In the context of IF, we want to make it so that a manifest is idempotent so that re-executing the manifest always generates the same result. We can't always guarantee this today, because many manifests have importer plugins as the first element in their pipelines, and we don't control the servers that serve the APIs called by the importers, so we can't be sure that the same request will always yield the same response and by extension we can't guarantee that the same manifest will always give the same output. We anticipate many (most?) manifests using importers in the future. We also anticipate sharing and re-executing manifests being one of the main use cases for IF in the future, and today these things are somewhat incompatible due to the lack of idempotence.

However, we can fix this issue from IF's side by separating out the IF execution into distinct phases and enabling intermediates to be exported at the end of each phase. The first phase can be the data import, which generates a static file. This static file can then be shared, archived, or passed into the next phase of execution.

Separating out distinct execution phases also helps with features such as time-sync and group-by. In our current monolithic design, we have to invoke these featuires at the right point in the pipeline in order for them to execute correctly. While it is obvious what the right sequence is for simple manifests, this isn't necessarily the case for more complex manifests and it can lead to confusion and over-complicated manifest files.

It is also inefficient to have to re-execute an entire manifest, including the requests to external APIs in the importer plugins, just to change one element later in the execution pipeline. It would be much more efficient if we could capture static files at various points in the execution flow that can then be used to re-execute specific parts of the pipeline.

Solution

Split the pipeline and the compute logic into 3 distinct phases:
- observe: the pipeline to generate, import, gather observations. The outputs from this pipeline should be a 1 d array of observations.
- group: Group the 1 d array of observations into a structure which makes sense for the induction step next (and aggregation, export etc...) similar to the group-by builtin.
- compute: given a set of observations, this is the pipeline to calculate impacts

tree:
  observe:
    - mock-observations
  group:
    - cloud/instance-type
  compute:
    - cloud-metadata
    - watttime
    - teads-curve
    - operational-carbon
  inputs: null

We first traverse the tree and run all the plugins in the observe pipeline, these get added to the inputs (for additive run mode) or replace the inputs (in replace mode). We have the option to capture the state of the manifest after these observe operations have been applied and save it to yaml file.

We then traverse the tree and run grouping logic on the inputs. We have the option to capture the state of the manifest after these observe operations have been applied and save it to yaml file.

We then traverse the tree and run the induce pipeline

For the observe and group phases only the inputs change. compute doesn't change the inputs it only generates outputs.

These phases should all run in sequence when you run the ie command. This is just the normal behaviour we have today. However, you should also be able to run each of the phases independently using --observe, --group and --compute commands.

If you just wanted to gather observations and then not run the rest of the pipelines you might run it with just the --observe flag like so:

tree:
  observe:
    - mock-observations
  group:
    - cloud/instance-type
  induce:
    - cloud-metadata
    - watttime
    - teads-curve
    - operational-carbon
  inputs:
    - timestamp: '2024-02-26 00:00:00'
      duration: 300
      cloud/instance-type: m5n.large
      cloud/vendor: aws
      cpu/utilization: 89
    - timestamp: '2024-02-26 00:05:00'
      duration: 300
      cloud/instance-type: m5n.large
      cloud/vendor: aws
      cpu/utilization: 59

ie --observe -m manifest.yml --output static-manifest.yml

This file can now act as a static manifest file which you can use without needing to run the importers.

Then you might run the above file with just --group and end up with something like so:

tree:
  observe:
    - mock-observations
  group:
    - cloud/instance-type
  induce:
    - cloud-metadata
    - watttime
    - teads-curve
    - operational-carbon
  children:
    m5n.large:
    inputs:
      - timestamp: '2024-02-26 00:00:00'
        duration: 300
        cloud/instance-type: m5n.large
        cloud/vendor: aws
        cpu/utilization: 89
      - timestamp: '2024-02-26 00:05:00'
        duration: 300
        cloud/instance-type: m5n.large
        cloud/vendor: aws
        cpu/utilization: 59

ie --group -m static-manifest.yml -o regrouped-manifest.yml

Again this is also a static manifest file which you can then run without any flags and it will run just the compute pipeline which generates outputs.

ie --compute -m regrouped-manifest.yml -o outputs.yml

Tasks

Refactor IF into three distinct execution phases
Add commands to the CLI to trigger each execution phase in isolation
Add ability to save state of manifest at end of each execution phase
Update example manifests to work with new "phased"! execution flow
Update documentation to explain new phased execution flow

How you can help

You can read through this post and give feedback in comments, especially if you are a plugin developer that currently relies on node-level config. Later, when the specific tasks are available as tickets on our issue board you can let us know if you want to work on one. There may be some that are reserved for core developers, but in general we are keen to open up IF development to the community.

@jawache @zanete @narekhovhannisyan @MariamKhalatova @manushak

jawache · 2024-06-05T14:38:41Z

jawache
Jun 5, 2024
Maintainer

@jmcook1186 one thing I've though re: group.

It perhaps is a re-group rather than group, as in imagine you've run it once, it has a grouping already applied.

The inputs would be "grouped"
The outputs would be "grouped"

Then you change the grouping in that output static manifest, file and re-run - what happens? The inputs are already grouped in the old way.

I think the right approach is in the start of the group phase, just turn everything into a 1-d time series of observations (ordered by time?) then run group again on this 1-d timeseries then pass that to the compute phase.

That's quite useful, then if you were given a static manifest grouped one way, you could re-group in different ways, re-run and then voila.

0 replies

andrew-woosnam · 2024-06-07T16:30:08Z

andrew-woosnam
Jun 7, 2024

inefficient to have to re-execute an entire manifest

Yes! Sold :) sounds much more "green by design"

Question about the 3 phases: is the idea that a manifest would be restricted to running through just a single instance of each phase? Or could we plausibly allow for multiple combinations/repetitions of group, compute, group, compute, group compute e.g.? (asking because @josh-swerdlow and I will need to rethink how the branch plugin will need to change for this new 3-phase system)

3 replies

jmcook1186 Jun 10, 2024
Maintainer Author

You could run each phase multiple times (or skip phases, or do things between phases) because IF would expose observe, regroup and compute as separate commands as well as if-run which would be a shortcut to running observe && regroup && compute. One of the things I like about this is that the individual commands can be used to build up more complex programs on top of IF, but I've only really thought about this in the context of chaining commands together on the command line or writing simple bash scripts to control the logic, rather than having higher level IF tooling to handle it.

josh-swerdlow Jun 10, 2024

Ok cool. Within the context of our branch plugin that gives me a bit to think about. I'm going to chew on this.

It makes me wonder if branching functionality would now fit better as an external script than a plugin...

Are there any plans or thoughts about taking any of the if-official plugins to external scripts? Jc

jawache Jun 11, 2024
Maintainer

The main thing we are moving to an external script (which was internal before) was the CSV generation. So there will be a separate if-csv script which extracts the relevant things you want from the manifest file into a CSV file. One of the reasons behind that is just because we were starting to put things related to the CSV export inside the manifest file and then the CLI became confusing (the reason for the #carbon at the end of the filename in the CLI was to inform the CSV portion of the code what field you wanted to export and it's also the reason for the outputs: [yaml, csv] portion of the yaml which caught a lot of people off guard).

#784 is the issue.

The main IF script will also just be printing to stdout (actually stderr) and then you can pipe that into your script.

if-run --manifest impact.yml | if-csv --field carbon

josh-swerdlow · 2024-06-08T19:51:24Z

josh-swerdlow
Jun 8, 2024

Does it makes sense for a 'exhaust'/'terminal' phase to be added to this as well?

One idea Andrew and I threw around for the hackathon was creating a plugin that would only be put at the end and it would analyze what had occurred in the pipeline to generate a human readable audit. While discussing the idea, we broke down the phases of work that could occur and created similar phases to the above suggestions, but also had a 'terminal' phase. This is where outputs would be created, validation checks could occur, graphs computed, or human readable audits could be written. They are always guaranteed to go at the end.

3 replies

jmcook1186 Jun 10, 2024
Maintainer Author

hi @josh-swerdlow, yes, except that we have this proposal for handling exhaust which we think will work better than plugins. In this proposal, IF can only generate yaml data and either dump it to the console or save it to file. Any other presentation or transformation of the data beyond the IF yaml output has to be handled in separate scripts that take that yaml data as input. In that way of thinking, all IF is really doing is enriching the inputs array repeatedly until it reaches the end of its execution pipeline, so there isn't a well-defined exhaust phase to separate out, which is why we handle it as part of compute.

jawache Jun 10, 2024
Maintainer

Yup exactly, we're seeing the manifest file as the communication protocol for environmental impacts. We've even started conversations with the standards working group to explore implementing it a formal specification. Impact Manifest Protocol (maybe when it's saved as a file we call them .imp files ;) ).

We experimented with a plugin system to exhaust data to 3rd party systems but then the manifest file stops looking like an evidence of emissions, and starts looking like a configuration file for an ETL tool as you have to add all the configuration for what you want to do with the raw data in the manifest file itself. How you want to display the data gets jumbled in with how you want to compute the data.

To keep things clean and separate we want tools to speak IMP to each other, the raw IF tool just computes manifest files, but then other tools can read (static, computed) manifest files and do whatever they want with them. Visualize, load into external systems, maybe run simulations etc...

josh-swerdlow Jun 10, 2024

That makes sense. Thanks for the thoughts :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Epic: Idempotence #768

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

New Epic: Idempotence #768

jmcook1186 Jun 3, 2024 Maintainer

Background and problem statement**

Solution

Tasks

How you can help

Replies: 3 comments · 6 replies

jawache Jun 5, 2024 Maintainer

andrew-woosnam Jun 7, 2024

jmcook1186 Jun 10, 2024 Maintainer Author

josh-swerdlow Jun 10, 2024

jawache Jun 11, 2024 Maintainer

josh-swerdlow Jun 8, 2024

jmcook1186 Jun 10, 2024 Maintainer Author

jawache Jun 10, 2024 Maintainer

josh-swerdlow Jun 10, 2024

jmcook1186
Jun 3, 2024
Maintainer

Replies: 3 comments 6 replies

jawache
Jun 5, 2024
Maintainer

andrew-woosnam
Jun 7, 2024

jmcook1186 Jun 10, 2024
Maintainer Author

jawache Jun 11, 2024
Maintainer

josh-swerdlow
Jun 8, 2024

jmcook1186 Jun 10, 2024
Maintainer Author

jawache Jun 10, 2024
Maintainer