Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding automatic sorting and duplicate removal #118

Closed
xiki-tempula opened this issue Mar 27, 2021 · 1 comment · Fixed by #119
Closed

Adding automatic sorting and duplicate removal #118

xiki-tempula opened this issue Mar 27, 2021 · 1 comment · Fixed by #119

Comments

@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Mar 27, 2021

Currently, the preprocessing.subsampling.statistical_inefficiency will reject the data set if it is not sorted and contains duplications.
I'm aware that @dotsdl is refactoring the subsampling module #98.

I think having the automatic sorting and duplication removal functionality will make the ABFE workflow more tolerable to corrupted datasets.

I'm thinking of adding the sorting and duplication removal to the preprocessing.subsampling.statistical_inefficiency.

def statistical_inefficiency(df, series=None, lower=None, upper=None, step=None,
                             conservative=True, drop_duplicates=True, sort=True):
    """Subsample a DataFrame based on the calculated statistical inefficiency
    of a timeseries.

    If `series` is ``None``, then this function will behave the same as
    :func:`slicing`.

    Parameters
    ----------
    df : DataFrame
        DataFrame to subsample according statistical inefficiency of `series`.
    series : Series
        Series to use for calculating statistical inefficiency. If ``None``,
        no statistical inefficiency-based subsampling will be performed.
    lower : float
        Lower bound to pre-slice `series` data from.
    upper : float
        Upper bound to pre-slice `series` to (inclusive).
    step : int
        Step between `series` items to pre-slice by.
    conservative : bool
        ``True`` use ``ceil(statistical_inefficiency)`` to slice the data in uniform
        intervals (the default). ``False`` will sample at non-uniform intervals to
        closely match the (fractional) statistical_inefficieny, as implemented
        in :func:`pymbar.timeseries.subsampleCorrelatedData`.
    drop_duplicates : bool
        Drop the duplicated lines based on time.
    sort : bool
        Sort the Dataframe based on the time column.
    """

IOr it is better to make it functionality of the ABFE workflow?

@orbeckst
Copy link
Member

orbeckst commented Apr 5, 2021

@dotsdl can you please look? Is this duplicating some of your effort? Should we do a stop-gap fix here?

orbeckst pushed a commit that referenced this issue Apr 14, 2021
* fix #118
* set default to false
* Update subsampling.py
* bump coverage
* update test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants