You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unless I am very mistaken, there is a big problem with using this library for training a model to be used in production.
For example, if one uses the ta.add_all_ta_features(...) function to create your ML dataset you end up with some indicators using their default window to calculate lagging indicators and some using ALL the data available.
Here is an example of two dataframes containing exactly the same data, one dataframe is 1000 rows long and one is 1001 rows long. Below is shown the results of (df_last_1000.iloc[100:] - df_last_1001.iloc[100:]).sum() As you can see, indicators that use ALL the data in their calculations are different from those with a defined window:
open 0.000000e+00
high 0.000000e+00
low 0.000000e+00
close 0.000000e+00
volume 0.000000e+00
volume_adi 2.009133e+05
volume_obv -8.036533e+05
volume_cmf 0.000000e+00
volume_fi -1.443620e-03
volume_em 0.000000e+00
volume_sma_em 0.000000e+00
volume_vpt 0.000000e+00
volume_vwap 0.000000e+00
volume_mfi 0.000000e+00
The most important implication of this is that when running a model trained on the data generated by ta.add_all_ta_features(...), the dataframe used to generate the vector for the production model would need to be EXACTLY the same length as the dataset used to train the model to be suitable. If you are training your model using a significant time period of financial data then this constraint becomes impractical at production due to calculation expense.
The workaround is, of course, to calculate the indicators for each row of your dataset by iterating over your dataset and applying the ta.add_all_ta_features(...) for each row based on a fixed number of preceding rows. However, this option should be part of the library. For example:
ta.add_all_ta_features(df, max_window=100, ...)
This would impact the expense of this operation but would ensure that when in production you'd know how much data you'd need to calculate a suitable vector for you production application.
The second gotcha, would be that given there is no defined window for a proportion of indicators calculated by ta.add_all_ta_features(...) then it is very difficult to be able to generate a suitable pair of test / training sets if this function is applied BEFORE the split because then there is information leak across the test / train sets. On the other hand, if you apply the function AFTER the split then the values generated by ta.add_all_ta_features(...) will be dependent on how long the training set is compared with the test set. You could, of course, make the test / train sets the same length but that is yet another constraint introduced by this issue.
Perhaps I am missing something? If so, clarification would be much appreciated!
The text was updated successfully, but these errors were encountered:
Unless I am very mistaken, there is a big problem with using this library for training a model to be used in production.
For example, if one uses the
ta.add_all_ta_features(...)
function to create your ML dataset you end up with some indicators using their default window to calculate lagging indicators and some using ALL the data available.Here is an example of two dataframes containing exactly the same data, one dataframe is 1000 rows long and one is 1001 rows long. Below is shown the results of
(df_last_1000.iloc[100:] - df_last_1001.iloc[100:]).sum()
As you can see, indicators that use ALL the data in their calculations are different from those with a defined window:The most important implication of this is that when running a model trained on the data generated by
ta.add_all_ta_features(...)
, the dataframe used to generate the vector for the production model would need to be EXACTLY the same length as the dataset used to train the model to be suitable. If you are training your model using a significant time period of financial data then this constraint becomes impractical at production due to calculation expense.The workaround is, of course, to calculate the indicators for each row of your dataset by iterating over your dataset and applying the
ta.add_all_ta_features(...)
for each row based on a fixed number of preceding rows. However, this option should be part of the library. For example:ta.add_all_ta_features(df, max_window=100, ...)
This would impact the expense of this operation but would ensure that when in production you'd know how much data you'd need to calculate a suitable vector for you production application.
The second gotcha, would be that given there is no defined window for a proportion of indicators calculated by
ta.add_all_ta_features(...)
then it is very difficult to be able to generate a suitable pair of test / training sets if this function is applied BEFORE the split because then there is information leak across the test / train sets. On the other hand, if you apply the function AFTER the split then the values generated byta.add_all_ta_features(...)
will be dependent on how long the training set is compared with the test set. You could, of course, make the test / train sets the same length but that is yet another constraint introduced by this issue.Perhaps I am missing something? If so, clarification would be much appreciated!
The text was updated successfully, but these errors were encountered: