Incremental analysis period #350

HexhamAllstar · 2024-01-04T18:46:37Z

HexhamAllstar
Jan 4, 2024

Loving the library so far, really great functionality for monitoring drift.

I had one question, my data size is very large and it's quite slow to calculate drift metrics on my reference and analysis periods. What I've done so far is fit the Univariate Drift estimator to my full test set, then loaded in a sample of production data as analysis which works well, but that analysis period only covers a few days.

Is it possible for me to pickle the results I have, load a new analysis dataset that covers a different period, calculate the metrics for this new period and then combine the results objects?

This way, I'd only need to load a few days of analysis data into memory at once for the calculation, and I don't have to recalculate metrics for periods I've already covered, but I could plot graphs that cover all of the analysis chunks. Then schedule something to perform this task e.g. once a week for continuous monitoring.

Sorry if this is already possible and I've missed it!

nnansters · 2024-01-05T10:12:17Z

nnansters
Jan 5, 2024
Maintainer

Hey @HexhamAllstar ,

thanks for your kind words, happy the library is useful to you!

Your question is definitely valid. Some functionality is already covered, some isn't.

We already support storing/caching (fancy pickling) a calculator object. Main use case here is to avoid re-fitting every time, as this might be quite expensive due to large amounts of reference data. So you can fit once and then use our CalculatorStore API to deal with persisting or retrieving a pre-fitted calculator (as long as your reference data doesn't change).

See the following example or the docs.

import nannyml as nml
reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()
column_names = ['car_value', 'salary_range', 'debt_to_income_ratio']
calc = nml.UnivariateDriftCalculator(
    column_names=column_names,
    timestamp_column_name='timestamp',
    continuous_methods=['jensen_shannon'],
)
# fit on reference only once. or every time the reference set gets update it
calc.fit(reference_df)
# save the fitted calculator
store = nml.io.store.FilesystemStore(root_path='/tmp/nml-cache')
store.store(calc, filename='univariate_drift_calc.pkl')
# load fitted calculator and used it on anlysis data. This will return None if we couldn't find a stored calculator
loaded_calc = store.load(filename='univariate_drift_calc.pkl', as_type=nml.UnivariateDriftCalculator)
results = loaded_calc.calculate(analysis_df)

Then, on to the second part: analysis data. You are very correct, there is currently no good way of avoiding recalculations.
The approach you're taking seems to be the best one: use a (pre-fitted) calculator on the yet unprocessed data, then combine. We follow the same approach in our cloud product, but we store results in a database there, which tackles the issue your having now.

In order to combine previously calculated Result instances with your latest one, I think it would suffice to just concatenate the DataFrames that live inside them and use those to create a new Result instance. The plotting functionality requires some other Result properties to be set, so it is easiest to just work from an existing one. Maybe easier to explain in pseudo-code:

# First, combine the rows of data. There should be no duplicates in here. Also probably have to deal with some nasty indexing stuff.
combined_data = pd.concat([old_result.data, new_result.data], axis=0).reset_index()

# Now make a deep copy of your latest Result instance so we have all properties set. 
# Ideally you would check "compatibility" of your Result instances, i.e. all of those properties should be equal
combined_result = copy.deepcopy(new_result)

# Now manually set the combined data on our combined Result instance
combined_result.data = combined_data

This should be a crude way of combining results. This question has popped up before already, so we might do a quick first iteration of this.

Does this help somewhat? I'd be happy to help if you give it a go yourself!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental analysis period #350

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Incremental analysis period #350

HexhamAllstar Jan 4, 2024

Replies: 1 comment

nnansters Jan 5, 2024 Maintainer

HexhamAllstar
Jan 4, 2024

nnansters
Jan 5, 2024
Maintainer