Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add audeer.unique() #149

Merged
merged 4 commits into from
Jul 5, 2024
Merged

Add audeer.unique() #149

merged 4 commits into from
Jul 5, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Jul 5, 2024

It might be that a user wants to get unique values of a sequence, in the order they are appearing in that sequence.

image

Copy link

codecov bot commented Jul 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.0%. Comparing base (ec72190) to head (e108c98).
Report is 2 commits behind head on main.

Additional details and impacted files
Files Coverage Δ
audeer/__init__.py 100.0% <100.0%> (ø)
audeer/core/utils.py 100.0% <100.0%> (ø)

... and 1 file with indirect coverage changes

@hagenw hagenw requested a review from ChristianGeng July 5, 2024 08:05
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the implicit type conversion as list is always desired?

import pandas as pd
import audeer

strlist = audeer.unique("aabbaa")
result = audeer.unique(strlist)
print(result)
>>> ['a', 'b']
ser = pd.Series([1,2,1,2,4,1])
result = audeer.unique(ser)
print(result)
>>> [1, 2, 4]

In order to return the original type, one would have to call "'.join(strlist) on the string input and pd.Series on the series. In both cases the union works but has an implicit type conversion. It that is desired this is ok.

A second point is the typing:

But in my superficial understanding, typing.Sequence covers everything that implements __iter__. So for example a pandas df (I have not checked it though) would fall in that category as well.

So would it be more precise if one would have a kind of union type for the input type hint, covering (at least) str, tuple, list and pd.Series?

@hagenw
Copy link
Member Author

hagenw commented Jul 5, 2024

typing.Sequence should be fine as the code will work with every object that implements __iter__.

But to support that, it is usually easier to convert everything to a fixed output type, instead of returning the original type. I also prefer to have a fixed output type in basic functions instead of defining the output type by the input type.

@ChristianGeng
Copy link
Member

typing.Sequence should be fine as the code will work with every object that implements __iter__.

But to support that, it is usually easier to convert everything to a fixed output type, instead of returning the original type. I also prefer to have a fixed output type in basic functions instead of defining the output type by the input type.

I see: a DataFrame is not a sequence:

import pandas as pd
import collections.abc
df = pd.DataFrame.from_dict({
    'v1': [1,2,1,2,4,1],
    'v2': [2,4,4,2,3,1],
    }
)
typecheck = isinstance(df, collections.abc.Sequence)
print(typecheck)
print(df.__iter__)

results in

False
<bound method NDFrame.__iter__ of    v1  v2
0   1   2
1   2   4
2   1   4
3   2   2
4   4   3
5   1   1>

In other words: pd.Dataframe implements __iter__ but the object is not a sequence.
Interestingly, a pd.Series is not a sequence either. Still unique works on it.

ser = pd.Series([1,2,1,2,4,1])
isinstance(ser, collections.abc.Sequence)
>>> False

But I am happy with it and will approve it as this becomes to academic.
I think one should add a string to the test case and then I will approve.

@ChristianGeng
Copy link
Member

I accidentally pressed the "close" button next to the comment button. Reopening. Apologies.

@ChristianGeng ChristianGeng reopened this Jul 5, 2024
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All question have been discussed.
Possibly adding a string to the test case would be nice in order to detail the behavior on strings.

I will not be fussy about it and approve though.

@hagenw
Copy link
Member Author

hagenw commented Jul 5, 2024

In the current Python documentation, they write:

image

and we have

>>> isinstance(pd.Series([1]), collections.abc.Iterable)
True

So I changed the type to typing.Iterable, which seems to be a better fit to me. Do you agree?

@ChristianGeng
Copy link
Member

So I changed the type to typing.Iterable, which seems to be a better fit to me. Do you agree?

Yes. A DataFrame is also an iterable though. I have approved already - it is probably hard to impossible to completely get it right with the python typing system: when one goes through the tests one can see what is going on. Alright with me.

@hagenw hagenw merged commit 7b1d7bc into main Jul 5, 2024
19 checks passed
@hagenw hagenw deleted the unique branch July 5, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants