Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Feature Constructor: Easier categorical features, allow creation of date/time, easier use of string data #3936

Merged
merged 5 commits into from
Aug 2, 2019

Conversation

janezd
Copy link
Contributor

@janezd janezd commented Jul 11, 2019

Fixes #2900. Fixes #3056.

  • Lets the expression for discrete variables return strings instead of indices.

    If the user defines values, the function must return integer indices. If function returns strings (or instances of Value), the user mustn't define values. It is contrived, but you can't have both. We will have to mention this in documentation. :) If you have a better solution, feel free to change the code.

  • Casts string values into strings, so name[0] is now a valid expression for zoo. Before, you'd have to write str(name)[0] -- and, of course, list all possible values.

  • Allows creation of Date / Time variables. If expression returns a string, it is parsed as date/time.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Jul 11, 2019

Codecov Report

Merging #3936 into master will decrease coverage by 0.24%.
The diff coverage is 69.38%.

@@            Coverage Diff            @@
##           master   #3936      +/-   ##
=========================================
- Coverage   84.95%   84.7%   -0.25%     
=========================================
  Files         376     371       -5     
  Lines       65907   64957     -950     
=========================================
- Hits        55988   55022     -966     
- Misses       9919    9935      +16

@ajdapretnar
Copy link
Contributor

The idea is the right one, but it doesn't seem to work. When I create a datetime variable from string, which returns '2019-01-01' for example, the widget says ValueError: Could not convert string to float: '2019-01-01'. There is a special way of creating datetime variables that has to be considered here.

@janezd
Copy link
Contributor Author

janezd commented Jul 12, 2019

When I create a datetime variable from string, which returns '2019-01-01' for example, the widget says ValueError: Could not convert string to float: '2019-01-01'.

That's why I asked whether it's useful - because I suspected it's not. :) The function is expected to return a float. So we should wrap the expression into a function that checks whether the result is a string and in this case converts it into a float? I'll see.

@janezd janezd force-pushed the featureconstructor-datetime branch from 9f99cb4 to f0516f9 Compare July 12, 2019 12:56
@janezd janezd changed the title [ENH] Allow creation of Date / Time variables in Feature Constructor [ENH] Feature Constructor: Easier categorical features, allow creation of date/time, easier use of string data Jul 12, 2019
@lanzagar lanzagar self-assigned this Jul 19, 2019
@ajdapretnar
Copy link
Contributor

Testing briefly, this is really cool. 👍

)
placeholderText="Expression...",
toolTip="Result must be a number for numeric variables, "
"and strings for text variables.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and a string for text and date/time variables.\n" ?
EDIT: Actually, I see that for datetime vars both are options as well - number or string. Although currently, the number must be a float and not an int, which is a bit awkward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be even better if subclasses would replace this tooltip with just the information specific to that variable type?

Copy link
Contributor Author

@janezd janezd Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agreed, implemented it, and no longer agree. It looks visually as the same line edit and the user who will get a tooltip once (and the default is that for numeric attributes, which tells the user to enter a numeric expression) will not expect to get a different tooltip the next time.

Please check and tell me that I can/should remove ac97c90. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is not perfect, but I think removing ac97c90 is worse... The best thing about the tooltip for datetime is that you were able to put in an example, so people actually know what they have to produce in the expression. If those 3 lines are added to the original 4 of the tooltip it becomes a mess.
And anyway, I think the problem would be minimal if when no variable has yet been selected it showed something generic or even no tooltip.
But until the next round of improvements, I would say let's just keep this commit. People not expecting a different tooltip is probably not that big a problem, because you usually need to check it when you have an error in the expression and then you see the relevant information.

Orange/widgets/data/tests/test_owfeatureconstructor.py Outdated Show resolved Hide resolved
self.valuesedit = QLineEdit()
tooltip = \
"If values are given, expression must return integer indices.\n" \
"Otherwise, expression must return strings."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the third option (using names of values instead of indices or strings) is intentionally not mentioned and is just a bonus hack for power users?
Since it is basically the same as using strings (minus the quotes), we could consider dropping the unmentioned 3rd option and make this tooltip true...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gladly. I suppose this was so exotic that we don't care about backward compatibility?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can ignore it in this case (not sure it was documented at any time anyway).

parse = Orange.data.TimeVariable("_").parse

def cast(e): # pylint: disable=function-redefined
if isinstance(e, float):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is a niche case, if we are supporting numbers I don't see a reason why 1000. (float) would be accepted and 1000 (int) not...
Maybe we should write the documentation together with these changes, so that we are sure what should work and what should not. It adds a nice incentive to make it simple enough that explaining it is not too hard :)

try:
return parse(e)
except: # pylint: disable=bare-except
return nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows user mistakes to pass too silently in my opinion. I would expect an error in the widget to be shown if I return a date "9/11/2001", "2019-01-33" or "2019-21-12" etc. Currently it could be that some dates work, but the incorrect just become missing values without any notice.
The documentation should also definitely link to https://en.wikipedia.org/wiki/ISO_8601 since most people don't actually use the iso standard for dates and probably don't know what they should look like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as a part of this PR, but it would be soooo nice if Orange worked with more than one datetime format. Especially when importing from Excel, which pretends to be smart and formats dates in its own way, this would make it a lot easier for people to import datetimes. I assume we can make a simple function that checks the most likely format and applies it to the datetime column... Or actually converts it into ISO, because I suppose nothing would work in the Timeseries add-on otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget why I wanted to catch exceptions here. Removed.

Accepting different date formats indeed belongs to a separate issue.

@codecov-io
Copy link

codecov-io commented Jul 30, 2019

Codecov Report

Merging #3936 into master will increase coverage by <.01%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##           master    #3936      +/-   ##
==========================================
+ Coverage   85.12%   85.13%   +<.01%     
==========================================
  Files         378      378              
  Lines       67117    67194      +77     
==========================================
+ Hits        57134    57204      +70     
- Misses       9983     9990       +7

@janezd janezd force-pushed the featureconstructor-datetime branch from f0516f9 to 09bfee3 Compare July 30, 2019 15:51
@janezd janezd force-pushed the featureconstructor-datetime branch from 09bfee3 to ac97c90 Compare July 30, 2019 16:06
@lanzagar lanzagar merged commit a8f3d78 into biolab:master Aug 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Constructor: easier categorical features Feature Constructor: Option to create Datetime variable
4 participants