Adding automatic sorting and duplicate removal #118

xiki-tempula · 2021-03-27T13:48:37Z

Currently, the preprocessing.subsampling.statistical_inefficiency will reject the data set if it is not sorted and contains duplications.
I'm aware that @dotsdl is refactoring the subsampling module #98.

I think having the automatic sorting and duplication removal functionality will make the ABFE workflow more tolerable to corrupted datasets.

I'm thinking of adding the sorting and duplication removal to the preprocessing.subsampling.statistical_inefficiency.

def statistical_inefficiency(df, series=None, lower=None, upper=None, step=None,
                             conservative=True, drop_duplicates=True, sort=True):
    """Subsample a DataFrame based on the calculated statistical inefficiency
    of a timeseries.

    If `series` is ``None``, then this function will behave the same as
    :func:`slicing`.

    Parameters
    ----------
    df : DataFrame
        DataFrame to subsample according statistical inefficiency of `series`.
    series : Series
        Series to use for calculating statistical inefficiency. If ``None``,
        no statistical inefficiency-based subsampling will be performed.
    lower : float
        Lower bound to pre-slice `series` data from.
    upper : float
        Upper bound to pre-slice `series` to (inclusive).
    step : int
        Step between `series` items to pre-slice by.
    conservative : bool
        ``True`` use ``ceil(statistical_inefficiency)`` to slice the data in uniform
        intervals (the default). ``False`` will sample at non-uniform intervals to
        closely match the (fractional) statistical_inefficieny, as implemented
        in :func:`pymbar.timeseries.subsampleCorrelatedData`.
    drop_duplicates : bool
        Drop the duplicated lines based on time.
    sort : bool
        Sort the Dataframe based on the time column.
    """

IOr it is better to make it functionality of the ABFE workflow?

The text was updated successfully, but these errors were encountered:

orbeckst · 2021-04-05T16:57:26Z

@dotsdl can you please look? Is this duplicating some of your effort? Should we do a stop-gap fix here?

* fix #118 * set default to false * Update subsampling.py * bump coverage * update test

xiki-tempula mentioned this issue Mar 27, 2021

Add sort and remove duplication to statistical_inefficiency #119

Merged

orbeckst closed this as completed in #119 Apr 14, 2021

orbeckst pushed a commit that referenced this issue Apr 14, 2021

Add sort and remove duplication to statistical_inefficiency (#119)

950b591

* fix #118 * set default to false * Update subsampling.py * bump coverage * update test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding automatic sorting and duplicate removal #118

Adding automatic sorting and duplicate removal #118

xiki-tempula commented Mar 27, 2021 •

edited

Loading

orbeckst commented Apr 5, 2021

Adding automatic sorting and duplicate removal #118

Adding automatic sorting and duplicate removal #118

Comments

xiki-tempula commented Mar 27, 2021 • edited Loading

orbeckst commented Apr 5, 2021

xiki-tempula commented Mar 27, 2021 •

edited

Loading