Skip to content

Time dependent data and data arithmetic?

Tasku edited this page Jul 19, 2021 · 7 revisions

I didn't know where we should put discussions like this, so I put it in the wiki for now. Let me (Tasqu) know if you think we should relocate my rant 😄

Related to old issues #44 and #52.

The problem:

At the moment, SpineInterface.jl supports two different types of time-dependent data:

  1. TimeSeries, which are essentially exactly what one would expect: an array of DateTimes and an array of the corresponding values.

  2. TimePattern, which is essentially a Dict mapping certain time periods (PeriodCollection) to the desired values.

SpineInterface.jl currently lacks any arithmetic to deal with TimeSeries and TimePattern types, so e.g. summing together two TimeSeries isn't supported. Funnily enough, we've never had to do stuff like this in SpineOpt.jl, as we're always only operating with values inside the TimeSeries and TimePattern types, but never with them directly. However, when doing any sort of data processing using SpineInterface.jl, this lack of arithmetic is something one runs into immediately when dealing with potentially time-dependent data. I spent a little too much time thinking about this last week, so I thought I'd share my thoughts.

The immediate fix to the situation:

TimeSeries-TimeSeries arithmetic

Implementing rudimentary arithmetic for TimeSeries should be pretty straightforward, except we'll currently have to make some assumptions to cover some special cases. In general, performing any simple arithmetic (+, -, *, /) with two TimeSeries simply comes down to the following:

  1. Merge and sort the timestamps from both TimeSeries to form the new index set.

  2. Sample both TimeSeries with each timestamp on the merged index set, perform the desired arithmetic operation, and save the resulting value for the corresponding timestamp.

Which is all nice and clear, as long as the two TimeSeries have the exact same set of timestamps. Otherwise, we start to run into issues regarding how to interpolate and extrapolate TimeSeries:

  • Currently, SpineInterface.jl has a built-in assumption about how to interpolate TimeSeries: simply use the latest value.

  • For the immediate fix, we could implement something similarly simplistic for the extrapolation: e.g. simply use a zero.

However, I feel like these aren't really ideal to have as assumptions, and should be up to the user to decide. See the Re-imagining time-dependent data section later for a proposal how interpolation and extrapolation of time-dependent data should be handled, in my opinion.

TimeSeries-TimePattern arithmetic

In principle, the basic idea behind the arithmetic stays the same: sample both types to construct the combined product. However, TimePatterns are much more vague with their timestamps than TimeSeries, and combining the timestamps is out of the question. Still, we can at least do the following:

  1. Use the timestamps from the TimeSeries to sample both the TimeSeries and the TimePattern.

  2. Perform the desired operation for the obtained values, and save the result in a new TimeSeries.

Thus, any arithmetic between a TimeSeries and a TimePattern would yield a TimeSeries as a result. I don't know if there would be any sense in forcing the opposite, where the TimeSeries is sampled with the TimePattern keys to form a new TimePattern, as we have no guarantee that the TimeSeries would repeat in sync with the TimePattern.

Similar to the TimeSeries-TimeSeries arithmetic, we might have to make assumptions about interpolating/extrapolating the TimePattern:

  • In the case of the TimePattern, though, it might make more sense to simply assume missing values as zeroes than use the latest valid value.

  • What does "extrapolation" really even mean for repeating TimePatterns?

TimePattern-TimePattern arithmetic?

This one is a bit of a beast. The same basic principle of sampling both TimePatterns with a set of keys to perform the desired operation still stands, but obtaining the required set of keys can be really challenging (or at least confusing). I believe that the following might work for TimePattern-TimePattern arithmetic:

  1. Take every unique combination of keys from both TimePatterns to form the keys for the combined TimePattern.

  2. Sample values from both TimePatterns with the new keys to perform the desired operation.

However, I have no idea if step 1 is even possible with arbitrary TimePatterns. Regardless, the question of interpolation/extrapolation still stands from the previous sections.

TimePattern-TimePattern arithmetic is probably not on top of our priority list, but I included it here for completeness' sake.

Re-imagining time-dependent data:

To me, time-dependent data has three key properties:

  1. Time is continuous: Thus, time-dependent data imperfectly depicts something happening continuously through time. As such, time-dependent data should support accessing the data at any point in time, not just at the predefined indices. However, handling this requires additional information about how to handle time inside/outside the provided data points.

  2. Time can be sampled over time periods in addition to timestamps: However, depending on the nature of the data, how the values should be aggregated.

  3. Time can be expressed in annoyingly many different ways: An explicit ISO 8601 timestamp with timezone information is probably the only truly unambiguous way of expressing time, but often data comes in ambiguous or repeating yearly/monthly/weekly/daily patterns. Furthermore, time can also be expressed relative to a desired point in time, although this is less often the case.

As such, I feel like our current data types are missing key information about how to handle interpolation, extrapolation, and aggregation. Furthermore, we currently have the option to repeat a TimeSeries, which trespasses on the TimePattern territory. The more I thought about it, the more TimeSeries and TimePattern started to blur together.

What defines time-dependent data?

To me, it would seem that generic time-dependent data requires the following "fields" of information in order to account for the three points I raised in the previous section:

  1. Timestamp-Value pairs: Stores the actual "data". However, in a generic case, we want to assume as little as possible about both the timestamps and the values. I'll discuss the possibilities of different timestamps in the next section.

  2. Interpolation instructions: How to sample data between the data points provided in point 1? Examples of different methods include:

    • Error: throws an error when any attempt to access data outside the provided timestamps is made. Default?
    • Flat: uses the latest value. (is there ever a need for interpolation using the next value instead?)
    • Linear: linear interpolation between the latest and the next value.
    • Other?: There are probably a bazillion different methods for interpolation.
  3. Extrapolation instructions: How to sample data outside the data points provided in point 1? Examples include:

    • Error: throws an error when any attempt to access data outside the provided timestamps is made. Default?
    • Constant: uses a predefined constant value outside the data points.
    • Flat: uses the first value before validn timestamps, and the latest value after the last valid timestamp.
    • Line?: linear interpolation based on preceding slope ad infinitum.
    • Cyclic: reuses the data as if it repeats ad infinitum, starting from the beginning after the end and vice versa.
    • Mirror?: reuses the data, but starts backtracking from the end. (Very niche applications, but I've seen this.)
    • Others?
  4. Aggregation instructions: How to aggregate the data points when data is accessed with a time period instead of a timestamp?

    • Error: throw an error when attempted. Default?
    • Mean: calculates the mean value over the time period.
    • Sum: Calculates the total accumulated value over the time period.
    • Min/Max: Returns the min/max value over the time period. (Can be useful for e.g. reserves?)
    • Others?

As far as I can tell, time-dependent data with this information attached to it can handle pretty much any meaningful way to access it.

However, aggregation combined with interpolation and extrapolation methods can become a nightmare, technically requiring numerically integrating over potentially more complicated methods. For the simpler interpolation/extrapolation methods, numerical integration can be avoided.

TimeSeries, TimePatterns, and the nature of timestamps...

As I mentioned earlier, the full ISO 8601 timestamp with time-zone information is truly unambiguous when it comes to time, whereas everything else is ambiguous in at least some way. However, data rarely comes with time-zone information attached, and often information about e.g. the time-zone is irrelevant. Thus, for as generic as possible representation of time-dependent data, we should be able to handle even ambiguous timestamps.

This is where the differences between TimeSeries and TimePatterns start to blur. Well, technically the difference already starts to blur with extrapolation methods like Cyclic and Mirror, explained in the previous section. E.g. an hourly pattern repeating each day could be defined with a day-long hourly TimeSeries with a Cyclic extrapolation method. However, TimePatterns can handle more: e.g. profiles repeating on certain weekdays or e.g. the first day of each month.

But here's the thing: both TimePatterns and TimeSeries can be defined using the same fields presented in the previous section, provided that a certain ambiguity is permitted in the timestamp. E.g. by using timestamps without year/month information, one could define a profile for every day of each month, the missing year/month values would be interpreted more or less as "wildcards". Similarly, since ISO 8601 supports a week number and weekday notation as well, one could define profiles for specific weekdays regardless of year/month/day.

When looking at things like this, a TimeSeries is nothing more than a less-ambiguous TimePattern.

Issues with ambiguous timestamps and interpolation/extrapolation

Well, while it might sound nice on paper to have a generic time-dependent data container that can handle both the functionality of TimeSeries and TimePattern, I'm not sure if it's doable in practise.

  • Handling arbitrary ambiguous timestamps for arithmetic between time-dependent data objects like this could be a nightmare.

    • The algorithm for the arithmetic should still work, though: merge & sort the timestamps, and sample values for the desired operation. However, forming the set of necessary timestamps can get extremely tricky.
  • Caution has to be taken with interpreting ambiguous timestamps vs interpolation/extrapolation. E.g. if a hourly time-series omits minute and second information, the ambiguous timestamp would interpret them as wildcards, so that the same value should be used regardless of minute/second information. However, the interpolation method could be set to e.g. Linear. Technically, there is no conflict, since the ambiguous timestamp covers the minute/second data in between, and thus doesn't require interpolation. However, this might not be clear for users.