Incremental analysis period #350
Replies: 1 comment
-
Hey @HexhamAllstar , thanks for your kind words, happy the library is useful to you! Your question is definitely valid. Some functionality is already covered, some isn't. We already support storing/caching (fancy pickling) a calculator object. Main use case here is to avoid re-fitting every time, as this might be quite expensive due to large amounts of reference data. So you can fit once and then use our See the following example or the docs. import nannyml as nml
reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
column_names = ['car_value', 'salary_range', 'debt_to_income_ratio']
calc = nml.UnivariateDriftCalculator(
column_names=column_names,
timestamp_column_name='timestamp',
continuous_methods=['jensen_shannon'],
)
# fit on reference only once. or every time the reference set gets update it
calc.fit(reference_df)
# save the fitted calculator
store = nml.io.store.FilesystemStore(root_path='/tmp/nml-cache')
store.store(calc, filename='univariate_drift_calc.pkl')
# load fitted calculator and used it on anlysis data. This will return None if we couldn't find a stored calculator
loaded_calc = store.load(filename='univariate_drift_calc.pkl', as_type=nml.UnivariateDriftCalculator)
results = loaded_calc.calculate(analysis_df) Then, on to the second part: analysis data. You are very correct, there is currently no good way of avoiding recalculations. In order to combine previously calculated # First, combine the rows of data. There should be no duplicates in here. Also probably have to deal with some nasty indexing stuff.
combined_data = pd.concat([old_result.data, new_result.data], axis=0).reset_index()
# Now make a deep copy of your latest Result instance so we have all properties set.
# Ideally you would check "compatibility" of your Result instances, i.e. all of those properties should be equal
combined_result = copy.deepcopy(new_result)
# Now manually set the combined data on our combined Result instance
combined_result.data = combined_data This should be a crude way of combining results. This question has popped up before already, so we might do a quick first iteration of this. Does this help somewhat? I'd be happy to help if you give it a go yourself! |
Beta Was this translation helpful? Give feedback.
-
Loving the library so far, really great functionality for monitoring drift.
I had one question, my data size is very large and it's quite slow to calculate drift metrics on my reference and analysis periods. What I've done so far is fit the Univariate Drift estimator to my full test set, then loaded in a sample of production data as analysis which works well, but that analysis period only covers a few days.
Is it possible for me to pickle the results I have, load a new analysis dataset that covers a different period, calculate the metrics for this new period and then combine the results objects?
This way, I'd only need to load a few days of analysis data into memory at once for the calculation, and I don't have to recalculate metrics for periods I've already covered, but I could plot graphs that cover all of the analysis chunks. Then schedule something to perform this task e.g. once a week for continuous monitoring.
Sorry if this is already possible and I've missed it!
Beta Was this translation helpful? Give feedback.
All reactions